Can we perform the text search present in the images or pdf files through elasticsearch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Can we perform the text search present in the images or pdf files through elasticsearch

Prashant Agrawal
This post was updated on .
Hi ES users,

Is there anyway we can perform the text search present in the images or pdf files through elasticsearch.

I mean to say that suppose I have pdf/image(will be stored in ES as base64 format) file indexed in ES. And if that image file contains "prashant" as text in it so is there a way I can search for the prashant and get the record for that image as well.
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

Rafał Kuć-3
Hello!

Please look at the attachment plugin for Elasticsearch: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html

It uses Apache Tika under the hood. The list of supported formats is
available here: http://tika.apache.org/0.10/formats.html

--
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> Hi ES users,

> Is there anyway we can perform the text search present in the images or pdf
> files through elasticsearch.

> I mean to say that suppose I have pdf/image(will be stored in ES as base64
> format) file indexed in ES. And if that image file contains "prashant" as
> text in it so is there a way I can search for the prashant and get the
> record for that image as well.



> --
> View this message in context:
> http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-presnet-in-the-images-or-pdf-files-through-elasticsearch-tp4054367.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/588849345.20140418080555%40alud.com.pl.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

Prashant Agrawal
Hi ,

If I am not wrong you are talking about https://github.com/elasticsearch/elasticsearch-mapper-attachments

So in this I can index the attachments(say pdf file) and that will be stored as base64 encoding. So is this plugin made available for searching the text present in pdf file as well?

If yes what will be the result if I search for some keyword in attachment, will it return the proper text data or the base64 encoded data?

~Prashant
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

Rafał Kuć-3
Hello!

You'll need to send the file contents to Elasticsearch in base64 form
and Elasticsearch will use Tika to extract data from the file.

However, in typical case, you would rather store, not the whole data
of the binary file (as it can be quite big), but rather a path to the
file, so that the application that will query Elasticsearch know where
to look for the original file itself.

--
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> Hi ,

> If I am not wrong you are talking about
> https://github.com/elasticsearch/elasticsearch-mapper-attachments
> <https://github.com/elasticsearch/elasticsearch-mapper-attachments>  

> So in this I can index the attachments(say pdf file) and that will be stored
> as base64 encoding. So is this plugin made available for searching the text
> present in pdf file as well?

> If yes what will be the result if I search for some keyword in attachment,
> will it return the proper text data or the base64 encoded data?

> ~Prashant



> --
> View this message in context:
> http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-present-in-the-images-or-pdf-files-through-elasticsearch-tp4054367p4054371.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2310555013.20140418083728%40alud.com.pl.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

Prashant Agrawal
So can I say that the mapper-attachment plugin is made to work like below:
Whether I am sending text file or pdf file or image file to ES , the plugin will extract the text content in all three scenarios and will store it into the ES and then it will be available for search as well?
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

Rafał Kuć-3
Hello!

The attachment plugin will use Tika to extract the text from binary
file content that you send in the base64. Tika does a good job with
text extraction, however you have to test it yourself, if your files
are parsed well enough for your use case.

--
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> So can I say that the mapper-attachment plugin is made to work like below:
> Whether I am sending text file or pdf file or image file to ES , the plugin
> will extract the *text content* in all three scenarios and will store it
> into the ES and then it will be available for search as well?



> --
> View this message in context:
> http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-present-in-the-images-or-pdf-files-through-elasticsearch-tp4054367p4054374.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/241416263.20140418094630%40alud.com.pl.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

Prashant Agrawal
Hi Rafał Kuć,
I tried doing the same but I didnt get the result as I want.
Just explaining the problem in details:

1) I have a pdf file which has the text as "There is already a big market for mid-range 4G LTE market, being pushed by telecom operators and device manufacturers."

2) I indexed this file in ES and when checked in ES the content present was in unicode like "PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiDQp4bWxuczpvPSJ1cm46c2NoZW1hHAtZXF1aXY9Q"

3) So if I search for "LTE" it wont return any result because the content stored in ES is in unicode format.

So my question is, Is there anyway or any plugin to store the pdf content in normal string format so that I can perform the search on top of that.
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

dadoonet
It should work with mapper attachment. Remember that what you see in _ source is not what you get indexed.


About extracting and storing text content, fsriver does it. See https://github.com/dadoonet/fsriver#generated-fields

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 22 avr. 2014 à 08:44, Prashant Agrawal <[hidden email]> a écrit :

Hi Rafał Kuć,
I tried doing the same but I didnt get the result as I want.
Just explaining the problem in details:

1) I have a pdf file which has the text as "There is already a big market
for mid-range 4G LTE market, being pushed by telecom operators and device
manufacturers."

2) I indexed this file in ES and when checked in ES the content present was
in unicode like
"PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiDQp4bWxuczpvPSJ1cm46c2NoZW1hHAtZXF1aXY9Q"

3) So if I search for "LTE" it wont return any result because the content
stored in ES is in unicode format.

So my question is, Is there anyway or any plugin to store the pdf content in
normal string format so that I can perform the search on top of that.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Can-we-perform-the-text-search-present-in-the-images-or-pdf-files-through-elasticsearch-tp4054367p4054541.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1398149086168-4054541.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/BA350CA1-91D3-4A80-9F7F-6A45DC742C66%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can we perform the text search presnet in the images or pdf files through elasticsearch

ajoe
This post has NOT been accepted by the mailing list yet.
In reply to this post by Prashant Agrawal
@Prashant Agrawal   Have you got a solution to the question you have asked. I am having the same issue. Can you please help.