Questions relating to elastic search

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions relating to elastic search

Karthik Shyamsunder
1) Can I use elastic search as a data store?  I keep hearing I can.  I definitely understand the classic issues like lack of transactions etc compared to database. But that is not of a concern for me.   I would prefer  to keep the document and the index together, rather than storing data in cassandra and the index in Elastic Search. Some people say that keeping data and index together is an anti-pattermn because that increases the size of the lucene index and so you cannot scale it separately.  What do you think? 

2) Assuming I can store data in elastic search, does the data get stored with the index, which means is my index size going to be more compared to not storing the data.  Because, if that is the case, I can store only index in elastic search and go to Cassandra or some other NOSQL database to get the actual doc.  Is storing data in index an anti-pattern?

3) I have five(5) 128GB, 16 core, 8x500GB dedicated hardware.  I would like  to index 200million documents with a total size 10TB.  In general, how many documents can I index?   What is the best you have seen.   There are about 50 fields and there is one field which concatenated of 10 webpages for a website.   Are we talking 100 or 200, 1000, 10000 ...docs per second per machine? 

4) Related to question 3, currently we index in Hadoop and bring the index in to solr (which is what we use currently).  It looks like we cannot do something like that in elastic search.  Is that right?  In other words, you have to use the ApI to index?

Please advise.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Questions relating to elastic search

ppearcy
1) Yes, you can. A couple of years ago, I would have said no. That being said, I don't use elasticsearch as my primary data store since any loss of data is not acceptable and I want to be 100% sure.

2) Keeping the data co-located with the index likely makes things faster. Maybe with Lucene it was an anti-pattern, but with the distributed nature of elasticsearch I would no longer consider that the case.

3) It all depends on the data, index setup and h/w. Using the Java node client and the bulk APIs will help cut down on overhead. You'd really need to run tests to confirm. I don't think 1000/sec is unreasonable. 

4) A lot of datastores have integrations/plugins available. I'm not familiar with doing this, but this seems to be an option: https://github.com/infochimps-labs/wonderdog

Best Regards,
Paul

On Friday, February 8, 2013 5:04:16 AM UTC-7, Karthik Shyamsunder wrote:
1) Can I use elastic search as a data store?  I keep hearing I can.  I definitely understand the classic issues like lack of transactions etc compared to database. But that is not of a concern for me.   I would prefer  to keep the document and the index together, rather than storing data in cassandra and the index in Elastic Search. Some people say that keeping data and index together is an anti-pattermn because that increases the size of the lucene index and so you cannot scale it separately.  What do you think? 

2) Assuming I can store data in elastic search, does the data get stored with the index, which means is my index size going to be more compared to not storing the data.  Because, if that is the case, I can store only index in elastic search and go to Cassandra or some other NOSQL database to get the actual doc.  Is storing data in index an anti-pattern?

3) I have five(5) 128GB, 16 core, 8x500GB dedicated hardware.  I would like  to index 200million documents with a total size 10TB.  In general, how many documents can I index?   What is the best you have seen.   There are about 50 fields and there is one field which concatenated of 10 webpages for a website.   Are we talking 100 or 200, 1000, 10000 ...docs per second per machine? 

4) Related to question 3, currently we index in Hadoop and bring the index in to solr (which is what we use currently).  It looks like we cannot do something like that in elastic search.  Is that right?  In other words, you have to use the ApI to index?

Please advise.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Questions relating to elastic search

Zachary Tong
In reply to this post by Karthik Shyamsunder
I'll weigh in on #3:

The common response is "it depends", because every situation is unique.  Doc size, number of docs, analyzer complexity, hardware, query pattern, etc etc.  All the variables make it very difficult to predict with any accuracy what kind of performance to expect.

However, that answer is always terribly unsatisfying, so here are some personal results I've obtained from benchmarking.  
  • My cluster is x3 servers with 32gb RAM, 8 cores and a software RAID1 disk setup, traditional 7200 rpm disks.  
  • Index was [3 shard, 1 replica].  
  • I could reliably hit 10,000-40,000 indexing requests/second using the non-bulk API.  I havent played with the bulk API yet to see what kind of performance is capable.  Doc size, field count and analyzer complexity drastically altered indexing speed.
  • QPS (whatever it was for the test) remained relatively constant up to 100m docs at which point I stopped the tests.  Max index size only hit 100Gb though, so you can see that the size of docs I was dealing with were still relatively small.
  • On my cluster, Disk I/O saturated before any other resource.
  • As soon as you start querying at the same time as indexing...these benchmarks go out the window.  Adding query overhead makes it a very different matter.
Hope that helps give you at least an idea of what's possible.  But honestly, it really depends and you should run some sample benchmarks yourself.

-Zach



On Friday, February 8, 2013 7:04:16 AM UTC-5, Karthik Shyamsunder wrote:
1) Can I use elastic search as a data store?  I keep hearing I can.  I definitely understand the classic issues like lack of transactions etc compared to database. But that is not of a concern for me.   I would prefer  to keep the document and the index together, rather than storing data in cassandra and the index in Elastic Search. Some people say that keeping data and index together is an anti-pattermn because that increases the size of the lucene index and so you cannot scale it separately.  What do you think? 

2) Assuming I can store data in elastic search, does the data get stored with the index, which means is my index size going to be more compared to not storing the data.  Because, if that is the case, I can store only index in elastic search and go to Cassandra or some other NOSQL database to get the actual doc.  Is storing data in index an anti-pattern?

3) I have five(5) 128GB, 16 core, 8x500GB dedicated hardware.  I would like  to index 200million documents with a total size 10TB.  In general, how many documents can I index?   What is the best you have seen.   There are about 50 fields and there is one field which concatenated of 10 webpages for a website.   Are we talking 100 or 200, 1000, 10000 ...docs per second per machine? 

4) Related to question 3, currently we index in Hadoop and bring the index in to solr (which is what we use currently).  It looks like we cannot do something like that in elastic search.  Is that right?  In other words, you have to use the ApI to index?

Please advise.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.