Analyzers and JSON

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Analyzers and JSON

Austin Harmon
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7349743-2fe5-41a3-b74f-22449a9b0197%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/147dea4b-54cb-43fe-b1df-6e2425c7ab99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Austin Harmon
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
Take a look at Apache Tika http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpvt4ZkL%3DZ4_tXv0S9xWs-f-pzZae3iMpFHyRmhDH1SBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Austin Harmon
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika <a href="http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;">http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="4G4l_sy04WAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">aharm...@...> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="4G4l_sy04WAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEoe2XBzTP15GgSCf8rjrtWMkvhkxtNzn1hXJ_8R8Fc%3D6w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Austin Harmon
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="DlPSredejicJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">aharm...@...> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika <a href="http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng" rel="nofollow" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;">http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="DlPSredejicJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Charlie Hull
In reply to this post by Austin Harmon
On 13/03/2015 15:42, Austin Harmon wrote:
> Thank you for the information. I've been trying to use the mapper
> attachment which has Apache Tika built into it. I am just surprised and
> confused that so many companies use elasticsearch but yet it is so
> difficult to index the contents of a document. If I need to index the
> contents of documents then would it be easier and more efficient to
> switch over to Apache Solr? As I said I have 2TB of data so it isn't
> efficient for me to manually input each document so it can be indexed
> with specific JSON. If you have any experience with Solr please let me
> know if it would be a good solution to my problem.

Hi Austin,

Solr's SolrCell lets you submit documents in various formats directly to
Solr, which then uses Tika to extract the plain text for indexing.
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

However we don't like this approach as Tika itself can fall over (when
faced with a great big complex PDF for example, I've seen ones that run
to 3000 pages) or just eat up all the resources on your Solr server. So,
we tend to run Tika as part of an external indexing process, written in
Python or Java, that then throws the plain text at Solr. We can then
manage it, restart it etc.

There are many other ways to do this as well of course - here's some
code that we wrote many moons ago which might be helpful:
https://code.google.com/p/flaxcode/source/browse/trunk/flax_filters/README

Cheers

Charlie

>
> thanks,
> Austin
>
> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
>
>     Take a look at Apache Tika http://tika.apache.org/
>     <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
>     It will allow you to extract the contents of the documents for
>     indexing, this is outside of the scope of the ElasticSearch
>     indexing.  A good tool to make these files downloadable is also out
>     of scope, but I'll answer to what is in scope.  You need to put the
>     files some where that they can be accessed by a URL.  Any webserver
>     is capable of this, of course your needs may very but this isnt the
>     list for those questions.  Once you have a URL that the document can
>     be accessed by, include that in your indexing of the document so
>     that you can point to that URL in your search results.
>
>     I am sure there are other options out there for extracting the
>     contents of word documents, Apache Tika is one that is frequently
>     used for this purpose though.
>
>     On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]
>     <javascript:>> wrote:
>
>         Okay so I have a large amount of data 2 TB and its all microsoft
>         office documents and pdfs and emails. What is the best way to go
>         about indexing the body of these documents so making the
>         contents of the document searchable. I tried to use the php
>         client but that isn't helping and I know there are ways to
>         convert files in php but is there nothing available that takes
>         in these types of documents? I tried the file_get_contents
>         function in php but it only takes in text documents. Also would
>         you know of a good tool or a method to make the files that are
>         searched downloadable?
>
>         Thanks,
>         Austin
>
>
>         On Thursday, March 12, 2015 at 12:26:13 PM UTC-5,
>         [hidden email] wrote:
>
>             Yes you need to include all the text you want indexed and
>             searchable as part of the JSON.
>
>             How else would you expect ElasticSearch to receive the data?
>
>             Regarding large scale production environments, this is why
>             ElasticSearch scales out.
>
>             Aaron
>
>             On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin
>             Harmon wrote:
>
>                 Hello,
>
>                 I'm trying to get an understand of the how to have full
>                 text search on the document and have the body of the
>                 document be considered during search. I understand how
>                 to do the mapping and use analyzers but what I don't
>                 understand is how they get the body of the document. If
>                 your fields are file name, file size, file path, file
>                 type how do the analyzers get the body of the document.
>                 Surely you wouldn't have to put the body of every
>                 document into the JSON, that is how I've seen it done in
>                 all the examples I've seen but that doesn't make sense
>                 for large scale production environments. If someone
>                 could please give me some  insight as to how this
>                 process works it would be greatly appreciated.
>
>                 Thank you,
>                 Austin Harmon
>
>         --
>         You received this message because you are subscribed to a topic
>         in the Google Groups "elasticsearch" group.
>         To unsubscribe from this topic, visit
>         https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe
>         <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
>         To unsubscribe from this group and all its topics, send an email
>         to [hidden email] <javascript:>.
>         To view this discussion on the web visit
>         https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
>         <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
>         For more options, visit https://groups.google.com/d/optout
>         <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email]
> <mailto:[hidden email]>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/550311EF.2030008%40flax.co.uk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
In reply to this post by Austin Harmon
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEq0QKD9Oq2BeA_v5Cm1SEEDfm0P2u9TWF6DkqYSxesOGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Austin Harmon
There is a plugin called mapper attachments: https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it. 
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="-P-Tf7u2I0wJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">aharm...@...> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika <a href="http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng" rel="nofollow" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;">http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="-P-Tf7u2I0wJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
Have you looked at the StandAloneRunner included with that plugin?  

I would practice with that, seeing first if it can extract the content, then seeing if it can extract the content from your base64 encoded version of the document.  When that is working, I suspect you should at that point be able to do what you are hoping.

However, while this plugin aims to make it easier, it does not make it more efficient.  You have mentioned many times that you have a large number of documents to process, and it sounds like you think that by avoiding putting the contents of the document into the JSON you are being more efficient.  Instead you have opted to put the entire document base64 encoded into the json, which is far less efficient.

Base64 encoding increases the size of a document.  Depending on the document format, it may also be increasing the size of the actual text.  However if you use Tika to extract the text yourself, not via plugin, then put that text into the json, then gzip and post the json, that will be the optimal way to post your documents for indexing.  It also gives you the greatest level of control, and will allow you to use the bulk API.

One note is that ElasticSearch has a maximum HTTP Post size by default.

http.max_content_length

If you are posting large documents you may exceed this, especially if you are using the bulk API.

If your concern is that you need to use PHP, then you do have an issue.  This should be written in Java to fully leverage Tika.  Writing it in Java will also allow you to leverage the Node API for writing to ElasticSearch.  All this will make your loading far more efficient than trying to stay in PHP.  If PHP is the only language you know, it might be time to learn another.  You should not be afraid to learn another language, you might find it is easier than what you have been doing so far.  If I had a requirement to do this in PHP, after significant objections to the requirement with adequate explanation that it was the wrong way to do it, I would then pursue finding alternatives to tika that will work in PHP.  I see there are extractors for the doc format, docx is xml in a zip file so that can be extracted, there are other options.  Worst case you could call a command line Tika to extract, then post using PHP, though this will be slow.

The real point is that in order for ElasticSearch to index your content, you need to show it the content.  Putting that content into JSON is not only a good way to do that, it is the way it is done with ElasticSearch.  You should stop looking for an alternative.  Even the plugin your are using will ultimately put the content into JSON and send it to ElasticSearch.  This does not mean that you have to store the full content of the document in ElasticSearch, your mappings on your index can take care of that.  It also does not mean that you have to retrieve the full content in your search results, your queries can take care of that if your mappings do not.



On Fri, Mar 13, 2015 at 11:49 AM, Austin Harmon <[hidden email]> wrote:
There is a plugin called mapper attachments: https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it. 
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpEro1b4ny%3DAbzRMU1LCFx-v5fnMxU1zz4rKQa7p6Oqgw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

dadoonet
In reply to this post by Austin Harmon
I’m a bit concerned about your « it does not work » statement.
1 bug and 3 feature requests.

Could you explain a bit more what is not working? May be I missed something.



-- 
David Pilato - Developer | Evangelist 




Le 13 mars 2015 à 10:49, Austin Harmon <[hidden email]> a écrit :

There is a plugin called mapper attachments: https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it. 
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="-P-Tf7u2I0wJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;" class="">aharm...@...> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika <a href="http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&amp;sa=D&amp;sntz=1&amp;usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng" rel="nofollow" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Ftika.apache.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng';return true;" class="">http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" class="">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" class="">https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;" class="">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" class="">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" class="">https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;" class="">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe';return true;" class="">https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="-P-Tf7u2I0wJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;" class="">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" class="">https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;" class="">https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
He posted limited details in a separate thread.

"mapper-attachment and base64 encoding"

I was not asserting that it does not work, just that it may not be the best way to handle "large number of documents".

I suspect there is an issue with encoding or submitting the document.




On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <[hidden email]> wrote:
I’m a bit concerned about your « it does not work » statement.
1 bug and 3 feature requests.

Could you explain a bit more what is not working? May be I missed something.



-- 
David Pilato - Developer | Evangelist 




Le 13 mars 2015 à 10:49, Austin Harmon <[hidden email]> a écrit :

There is a plugin called mapper attachments: https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it. 
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

dadoonet
Thanks. I missed the post.
Will answer there.

-- 
David Pilato - Developer | Evangelist 




Le 13 mars 2015 à 12:41, Aaron Mefford <[hidden email]> a écrit :

He posted limited details in a separate thread.

"mapper-attachment and base64 encoding"

I was not asserting that it does not work, just that it may not be the best way to handle "large number of documents".

I suspect there is an issue with encoding or submitting the document.




On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <[hidden email]> wrote:
I’m a bit concerned about your « it does not work » statement.
1 bug and 3 feature requests.

Could you explain a bit more what is not working? May be I missed something.



-- 
David Pilato - Developer | Evangelist 




Le 13 mars 2015 à 10:49, Austin Harmon <[hidden email]> a écrit :

There is a plugin called mapper attachments: https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it. 
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not.  I am aware of its capabilities but have not had reason to myself.  

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document.  They are indexing events, database records, web pages and so on.  It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[hidden email]> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem. 

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/.  It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the ElasticSearch indexing.  A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope.  You need to put the files some where that they can be accessed by a URL.  Any webserver is capable of this, of course your needs may very but this isnt the list for those questions.  Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[hidden email]> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin


On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [hidden email] wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect ElasticSearch to receive the data?

Regarding large scale production environments, this is why ElasticSearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some  insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5F337472-7F1B-462F-A9A2-A617D6F4536A%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Austin Harmon
Hello,

I'm running an instance of elasticsearch 1.3.2 on ubuntu server 14.04 on a imac. I have the mapper-attachments plugin installed and elasticsearch gui which I'm using for my front end. 

It's possible that I am missing something here are all the things I've tried so far:

I got the mapper-attachments plugin installed.
Then I created the index with mapping:

curl -XPUT 'http://localhost:9200/historicdata' -d '{"mappings":{"docs":{"properties":{"content":{"type":"attachment"}}}}}'

now I use a php script to take the documents and convert the docs and contents to base64

<?php

$root = '/home/aharmon/test';
 
 
$iters = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root),
RecursiveIteratorIterator::CHILD_FIRST);
try {
foreach( $iters as $fullFileName => $iter ) {
$base64 = base64_encode($iter);
$indexarray = array ("File" => $base64);
$jsonarray = json_encode($indexarray);
file_put_contents("/home/aharmon/data.json", $jsonarray, FILE_APPEND);
}
}
catch (UnexpectedValueException $e) {
printf("Directory [%s] contained a directory we can not recurse into", $root);
}


?>

Then I take my data.json file and implement the bulk API:

{"index": {"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRm"}
{"index": {"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="} 
{"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUueGxz"}
{"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0FnZW5jaWVzIE1hc3RlciBMaXN0Lnhsc3g="}

This is in a separate folder called bulk-requests

Then I run this command:

curl -s -XPOST localhost:9200/_bulk --data-binary @bulk-requests; echo

I got a successful message back so it is all indexed.

Then I run this command:

curl -XGET 'http://localhost:9200/historicdata/docs/_search' '{"fields": [ "content.content_type" ], "query":{"match":{"content.content_type":"text plain"}}}'

{"took":2, "timed_out":false,"_shards":{"total":5,"successful":5,"failed":0}, "hits":{"total":2,"max_score":1.0,"hits":[{"_index":"historicdata","_type":"docs","_id":"LMkqzKbyWTGffNtr1mGPZA","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi9-ZXN0L)EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRM"}}, {"_index":"historicdata","_type":"docs","_id":"GBEIWECwRgiUbYB6pnq7dQ","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="} }]}}

So it is indexing the documents and the search works but the contents isn't being decoded from base64. Maybe there is a general rule with base64 that I don't know that is assumed? I have followed the documentation religiously on github and elasticsearch's site. Also when I decode the base64 within the php script before I put it into the json array, it all says null. These are .xlsx, .xls, and .pdf documents. 

Thanks for your help guys, It is greatly appreciated.

Let me know if you need any more information than what I have provided.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analyzers and JSON

Aaron Mefford
Well.. I think I may see your issue.  

I decoded this string:

L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM=

It is:

/home/aharmon/test/A Plus - Media Plan Summary.xls

Another is:
/home/aharmon/test/A Plus - Summary by Venue.pdf

I think you misunderstand the purpose or how this all fits together.

As I said you must send the contents of the document to ElasticSearch for indexing.  Sending the file name is not sufficient, unless you are just hoping to index the file name, but then why all the fuss with the Tika extension.

Your PHP code needs to read the full binary content of the xls, xlsx or PDF.  Then base64 encode that full content.  This will be a very large string, about 33% larger than the original file.  This is done because Base64 has a safe character set that is acceptable in a JSON document while the raw binary is not acceptable in a JSON document.

With this understanding, perhaps you will now understand why it has been suggested this is not the ideal way to do a large volume of documents in this manner.  It will be more efficient, to run tika locally, build your JSON, compress your json and then send it to ES.



On Fri, Mar 13, 2015 at 3:05 PM, Austin Harmon <[hidden email]> wrote:
Hello,

I'm running an instance of elasticsearch 1.3.2 on ubuntu server 14.04 on a imac. I have the mapper-attachments plugin installed and elasticsearch gui which I'm using for my front end. 

It's possible that I am missing something here are all the things I've tried so far:

I got the mapper-attachments plugin installed.
Then I created the index with mapping:

curl -XPUT 'http://localhost:9200/historicdata' -d '{"mappings":{"docs":{"properties":{"content":{"type":"attachment"}}}}}'

now I use a php script to take the documents and convert the docs and contents to base64

<?php

$root = '/home/aharmon/test';
 
 
$iters = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root),
RecursiveIteratorIterator::CHILD_FIRST);
try {
foreach( $iters as $fullFileName => $iter ) {
$base64 = base64_encode($iter);
$indexarray = array ("File" => $base64);
$jsonarray = json_encode($indexarray);
file_put_contents("/home/aharmon/data.json", $jsonarray, FILE_APPEND);
}
}
catch (UnexpectedValueException $e) {
printf("Directory [%s] contained a directory we can not recurse into", $root);
}


?>

Then I take my data.json file and implement the bulk API:

{"index": {"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRm"}
{"index": {"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="} 
{"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUueGxz"}
{"_index": "historicdata", "_type": "docs" } } 
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0FnZW5jaWVzIE1hc3RlciBMaXN0Lnhsc3g="}

This is in a separate folder called bulk-requests

Then I run this command:

curl -s -XPOST localhost:9200/_bulk --data-binary @bulk-requests; echo

I got a successful message back so it is all indexed.

Then I run this command:

curl -XGET 'http://localhost:9200/historicdata/docs/_search' '{"fields": [ "content.content_type" ], "query":{"match":{"content.content_type":"text plain"}}}'

{"took":2, "timed_out":false,"_shards":{"total":5,"successful":5,"failed":0}, "hits":{"total":2,"max_score":1.0,"hits":[{"_index":"historicdata","_type":"docs","_id":"LMkqzKbyWTGffNtr1mGPZA","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi9-ZXN0L)EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRM"}}, {"_index":"historicdata","_type":"docs","_id":"GBEIWECwRgiUbYB6pnq7dQ","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="} }]}}

So it is indexing the documents and the search works but the contents isn't being decoded from base64. Maybe there is a general rule with base64 that I don't know that is assumed? I have followed the documentation religiously on github and elasticsearch's site. Also when I decode the base64 within the php script before I put it into the json array, it all says null. These are .xlsx, .xls, and .pdf documents. 

Thanks for your help guys, It is greatly appreciated.

Let me know if you need any more information than what I have provided.



--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEq7hGbOjpryy-j7ce%3Dw3KqY5UP75OB-2ab3TTMtFuKrTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.