Document Processing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Document Processing

davrob
Hi,

After several years of using ES for speeding up searches on relational database objects, I've been asked to quote on a project to index a 10TB store of documents, mostly held in Documentum.  By my rough calculations if each document is, on average, 1 or 2 MB, then we have about 5 or 10 million docs.

I'm looking to get an insight into the decisions people have made:

1) ES Architecture - number of nodes, size of each node in terms of disk space, CPUs and RAM.
2) Indexing 
  i) Initial Indexing of all docs - if I was doing this in my usual way, I would probably have a process on each node, pulling a defined set of documents from Documentum, doing a transformation into JSON then bulking them into ES.  Is there a river that anyone knows of, to be honest I haven't used rivers before because there weren't that many when I first used ES, so I created my own java processes that reads data, coverts it to JSON and then bulk index it into ES.
  ii) Do you convert the entire document into JSON and push it to ES, or just take the first 10,000 words and assume anything else will be a repeat?
 iii) Indexing New Docs as they are added to the - I guess this is where rivers help, but I could always code up something by hand, if necessary, to monitor Documentum and File Systems for changes.
3) Partioning/Routing - I don't think there is much I can do hear, because data is not held per user, or easily routed according to dates, but other's experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Document Processing

davrob
Bump - just in case someone has some experience they want to share.

On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:
Hi,

After several years of using ES for speeding up searches on relational database objects, I've been asked to quote on a project to index a 10TB store of documents, mostly held in Documentum.  By my rough calculations if each document is, on average, 1 or 2 MB, then we have about 5 or 10 million docs.

I'm looking to get an insight into the decisions people have made:

1) ES Architecture - number of nodes, size of each node in terms of disk space, CPUs and RAM.
2) Indexing 
  i) Initial Indexing of all docs - if I was doing this in my usual way, I would probably have a process on each node, pulling a defined set of documents from Documentum, doing a transformation into JSON then bulking them into ES.  Is there a river that anyone knows of, to be honest I haven't used rivers before because there weren't that many when I first used ES, so I created my own java processes that reads data, coverts it to JSON and then bulk index it into ES.
  ii) Do you convert the entire document into JSON and push it to ES, or just take the first 10,000 words and assume anything else will be a repeat?
 iii) Indexing New Docs as they are added to the - I guess this is where rivers help, but I could always code up something by hand, if necessary, to monitor Documentum and File Systems for changes.
3) Partioning/Routing - I don't think there is much I can do hear, because data is not held per user, or easily routed according to dates, but other's experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Document Processing

InquiringMind
David,

Well, I don't have a lot to add, but here are a few things I can think of:

1. You are writing your own bulk loader in Java. That's great! I've done the same thing, and it works very well. I can even add my own statistics, and show a bounded set of errors (for instance, if a thousand errors occur then showing just the first 100 or so is more than enough noise to get my attention!). These are things that tend not to be in the rivers. 

2. ES has enough back-end threading to keep you from needing to do much threading on the bulk load side. For example, I bulk load two sets of data: One with 97 million small documents, and another with 78 million medium-sized documents (nothing as large as yours). I load them (and load subsequent update sets) serially in one thread. ES keeps itself, and the CPU and disk, plenty busy enough (even on a laptop with one relatively slow disk doing all the input reading and index writing). Starting with ES 0.90.0, I got my initial load times down to less than 3 hours for each of these sets... on that laptop!

3. ES version 0.90.3 is the latest stable release and you should move to it. Especially since you use Java, some methods were deprecated even from 0.90.0 to 0.90.3 so it's good to keep moving forward to minimize the breaking changes. Of course, this builds on #1 above: Writing my own loaders and query tools in Java means that I can rebuild the client side to match the ES version. Using 3rd-party rivers, you are sometimes locked into older versions of ES. Bah humbug to that!

ES is fast, small, and flexible enough to experiment. Try loading increasingly large subsets of your Documentum store and plot the increase in time and disk space. Tweak and tune, and ask specific questions as you go along. Then you'll know for sure when you've hit the sweet spot for your full load and discovered the best strategy. For instance, your thoughts on creating the JSON for just the first N (10,000?) words and then creating a link to the full document (implied?) might be best, but if the update rate is relatively small then perhaps storing the entire thing won't be too bad. Just a guess though.

Brian

On Wednesday, August 14, 2013 3:03:33 PM UTC-4, davrob2 wrote:
Bump - just in case someone has some experience they want to share.

On Sunday, 11 August 2013 16:36:35 UTC+1, davrob2 wrote:
Hi,

After several years of using ES for speeding up searches on relational database objects, I've been asked to quote on a project to index a 10TB store of documents, mostly held in Documentum.  By my rough calculations if each document is, on average, 1 or 2 MB, then we have about 5 or 10 million docs.

I'm looking to get an insight into the decisions people have made:

1) ES Architecture - number of nodes, size of each node in terms of disk space, CPUs and RAM.
2) Indexing 
  i) Initial Indexing of all docs - if I was doing this in my usual way, I would probably have a process on each node, pulling a defined set of documents from Documentum, doing a transformation into JSON then bulking them into ES.  Is there a river that anyone knows of, to be honest I haven't used rivers before because there weren't that many when I first used ES, so I created my own java processes that reads data, coverts it to JSON and then bulk index it into ES.
  ii) Do you convert the entire document into JSON and push it to ES, or just take the first 10,000 words and assume anything else will be a repeat?
 iii) Indexing New Docs as they are added to the - I guess this is where rivers help, but I could always code up something by hand, if necessary, to monitor Documentum and File Systems for changes.
3) Partioning/Routing - I don't think there is much I can do hear, because data is not held per user, or easily routed according to dates, but other's experience would be welcome.

thanks,

David.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.