Indexing large pdf document

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing large pdf document

Jakko Sikkar
Hi,

I'm trying to index big document with ES and Mapper Attachment plugin (https://github.com/elastic/elasticsearch-mapper-attachments). Document has 719 pages, but after indexing I can search phrases only up to page 33. When I index a document I'm base64 encoding the file contents and file get successfully added to the index. Is there some limits of the size of the file?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing large pdf document

dadoonet
There is a limit of the number of extracted characters.



-- 
David Pilato - Developer | Evangelist 





Le 26 mars 2015 à 10:51, Jakko Sikkar <[hidden email]> a écrit :

Hi,

I'm trying to index big document with ES and Mapper Attachment plugin (https://github.com/elastic/elasticsearch-mapper-attachments). Document has 719 pages, but after indexing I can search phrases only up to page 33. When I index a document I'm base64 encoding the file contents and file get successfully added to the index. Is there some limits of the size of the file?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/68560613-23C5-4398-A7F0-FEFBACF83DEA%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing large pdf document

Jakko Sikkar
Thank you very much for pointing that out, I read documentation but skipped that part somehow :)


neljapäev, 26. märts 2015 12:51.50 UTC+2 kirjutas David Pilato:
There is a limit of the number of extracted characters.

See <a href="https://github.com/elastic/elasticsearch-mapper-attachments#indexed-characters" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felastic%2Felasticsearch-mapper-attachments%23indexed-characters\46sa\75D\46sntz\0751\46usg\75AFQjCNEKexATiKn9s4Z4dsRxZX5rHwI-Iw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felastic%2Felasticsearch-mapper-attachments%23indexed-characters\46sa\75D\46sntz\0751\46usg\75AFQjCNEKexATiKn9s4Z4dsRxZX5rHwI-Iw';return true;">https://github.com/elastic/elasticsearch-mapper-attachments#indexed-characters


-- 
David Pilato - Developer | Evangelist 
<a href="http://elastic.co" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Felastic.co\46sa\75D\46sntz\0751\46usg\75AFQjCNFdyKekEe2sE9ffaELzkrPofTtf6g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Felastic.co\46sa\75D\46sntz\0751\46usg\75AFQjCNFdyKekEe2sE9ffaELzkrPofTtf6g';return true;">elastic.co
<a href="https://twitter.com/dadoonet" style="color:rgb(17,85,204)" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fdadoonet\46sa\75D\46sntz\0751\46usg\75AFQjCNE-DMC3YEu3X_lhRIhUzuSZGsaSqA';return true;">@dadoonet | <a href="https://twitter.com/elasticsearchfr" style="color:rgb(17,85,204)" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Felasticsearchfr\46sa\75D\46sntz\0751\46usg\75AFQjCNGfXdQ98RWFMJXdiqpKnZb5GMg0zA';return true;">@elasticsearchfr | <a href="https://twitter.com/scrutmydocs" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fscrutmydocs\46sa\75D\46sntz\0751\46usg\75AFQjCNGQHZ4bKdE7mbdrGbZXOxhmD7c8Fw';return true;">@scrutmydocs





Le 26 mars 2015 à 10:51, Jakko Sikkar <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="SsSe6StDqXQJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">jakko....@...> a écrit :

Hi,

I'm trying to index big document with ES and Mapper Attachment plugin (<a href="https://github.com/elastic/elasticsearch-mapper-attachments" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felastic%2Felasticsearch-mapper-attachments\46sa\75D\46sntz\0751\46usg\75AFQjCNFGmlRzU0TjzdTmIaO_mXlPZngk3Q';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felastic%2Felasticsearch-mapper-attachments\46sa\75D\46sntz\0751\46usg\75AFQjCNFGmlRzU0TjzdTmIaO_mXlPZngk3Q';return true;">https://github.com/elastic/elasticsearch-mapper-attachments). Document has 719 pages, but after indexing I can search phrases only up to page 33. When I index a document I'm base64 encoding the file contents and file get successfully added to the index. Is there some limits of the size of the file?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="SsSe6StDqXQJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/04dd35e4-1caf-4a30-8f24-13cf47907067%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ff655c88-1e8a-4703-935a-f0136deee442%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.