Using ElasticSearch as Primary Data Store

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Using ElasticSearch as Primary Data Store

vaidik
Hi Folks,

I am working on a project where we have the following specifications:
  1. Collect events at the rate of about 50-100 events/second creating almost new documents for each event.
  2. We are going to query and perform multiple terms facets on that data.
  3. We are not going to be performing Full Text Searches.

I have played around with Elasticsearch and it solves the purpose from application point-of-view. We are able to solve all our application requirements.

We would like to use just one data store where the entire data resides. And I know Elasticsearch can be used as a primary data store. But, with this scale of documents, I wonder if anyone is using Elasticsearch as their primary data store and not storing data anywhere else at all for purposes like recovery, checking accuracy, etc.

I came across a case-study on Elasticsearch.com which mentions DataDog is using Elasticsearch as the only datastore and they moved from Postgres to ES. But the case-study itself doesn't have more specific information on the challenges and what might one be ready for when you want to have Elasticsearch as the single source of truth and how to prepare for no data loss at all if something goes wrong in the ES cluster. What scenarios must one be prepared for? What else?

Are there are any stories around this use-case that might help us? Would be glad to be pointed in the right direction or get some advices.

Thanks,

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5nnSxmadfDz3JQ%3De3KH%3D0n6xqN_qPCv1NjB%2BmjmKTnf4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

Radu Gheorghe-2
Hello Vaidik,

I think ES is pretty safe to use as a primary data store: if a node goes down and you have replicas it will continue to function and so on.

That said, if something goes seriously wrong in your cluster and your data becomes corrupted across all the replicas or something similarly tragic, you can go for backups. I guess this is valid for all data stores.

If you need backups, you can look at the snapshot-restore API which just came out with 1.0.0 beta 2.


On Tue, Dec 10, 2013 at 2:28 PM, Vaidik Kapoor <[hidden email]> wrote:
Hi Folks,

I am working on a project where we have the following specifications:
  1. Collect events at the rate of about 50-100 events/second creating almost new documents for each event.
  2. We are going to query and perform multiple terms facets on that data.
  3. We are not going to be performing Full Text Searches.

I have played around with Elasticsearch and it solves the purpose from application point-of-view. We are able to solve all our application requirements.

We would like to use just one data store where the entire data resides. And I know Elasticsearch can be used as a primary data store. But, with this scale of documents, I wonder if anyone is using Elasticsearch as their primary data store and not storing data anywhere else at all for purposes like recovery, checking accuracy, etc.

I came across a case-study on Elasticsearch.com which mentions DataDog is using Elasticsearch as the only datastore and they moved from Postgres to ES. But the case-study itself doesn't have more specific information on the challenges and what might one be ready for when you want to have Elasticsearch as the single source of truth and how to prepare for no data loss at all if something goes wrong in the ES cluster. What scenarios must one be prepared for? What else?

Are there are any stories around this use-case that might help us? Would be glad to be pointed in the right direction or get some advices.

Thanks,

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5nnSxmadfDz3JQ%3De3KH%3D0n6xqN_qPCv1NjB%2BmjmKTnf4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_05h2Rj7rpetvEvoXTZVaqBBza7deM6mTu%3D%3D6m-hMc%2B9A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

vaidik
Fair enough. That introduces me to a couple of things to explore.

I shall shoot more questions on this very thread if and when I have more to ask/consult/share.

Thanks for your answer. :)

Vaidik Kapoor
vaidikkapoor.info


On 10 December 2013 18:45, Radu Gheorghe <[hidden email]> wrote:
Hello Vaidik,

I think ES is pretty safe to use as a primary data store: if a node goes down and you have replicas it will continue to function and so on.

That said, if something goes seriously wrong in your cluster and your data becomes corrupted across all the replicas or something similarly tragic, you can go for backups. I guess this is valid for all data stores.

If you need backups, you can look at the snapshot-restore API which just came out with 1.0.0 beta 2.


On Tue, Dec 10, 2013 at 2:28 PM, Vaidik Kapoor <[hidden email]> wrote:
Hi Folks,

I am working on a project where we have the following specifications:
  1. Collect events at the rate of about 50-100 events/second creating almost new documents for each event.
  2. We are going to query and perform multiple terms facets on that data.
  3. We are not going to be performing Full Text Searches.

I have played around with Elasticsearch and it solves the purpose from application point-of-view. We are able to solve all our application requirements.

We would like to use just one data store where the entire data resides. And I know Elasticsearch can be used as a primary data store. But, with this scale of documents, I wonder if anyone is using Elasticsearch as their primary data store and not storing data anywhere else at all for purposes like recovery, checking accuracy, etc.

I came across a case-study on Elasticsearch.com which mentions DataDog is using Elasticsearch as the only datastore and they moved from Postgres to ES. But the case-study itself doesn't have more specific information on the challenges and what might one be ready for when you want to have Elasticsearch as the single source of truth and how to prepare for no data loss at all if something goes wrong in the ES cluster. What scenarios must one be prepared for? What else?

Are there are any stories around this use-case that might help us? Would be glad to be pointed in the right direction or get some advices.

Thanks,

Vaidik Kapoor
vaidikkapoor.info

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5nnSxmadfDz3JQ%3De3KH%3D0n6xqN_qPCv1NjB%2BmjmKTnf4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_05h2Rj7rpetvEvoXTZVaqBBza7deM6mTu%3D%3D6m-hMc%2B9A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5kzQxODbfRiNm3fUZ-xnU9Bs7BG%2BrooQdVLrHCuB3nKuQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

Eugene Strokin
I use ES as a primary datasource from 0.2 version. It is in production for almost 2 years. Starting from 5 shards all on the same node, to 1 replica of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in average. All kind of requests, searches, filtering, sorting, faceting. I had transferring whole cluster even to different datacenters with zero down time several times. All problems I had was only because I did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra - not easy to support complex (and I wouldn't even call it complex really) data structure. Hbase on hadoop - too low level. And performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

vaidik
For our use-case, we need facets desperately. Otherwise we will have to do that in the application logic, which is not ideal and honestly a lot of work too. ES gives me that. However, with the number of documents we need to index per second (30-50 per second, and this number is going to grow with time), I wonder what do people do to make sure that:

* There are least chances of data loss. You cannot flush segments to disk very quickly as that won't be optimal and to my knowledge a lot of unoptimized segments will be created if I manually use the Flush API. So what does ES do, when data has been written to translog but the operations have not been flushed and the node goes down?
* If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM heap properly and how it can affect my cluster. Consider this: we have three nodes and we are indexing data in it at the rate of 30-50 docs per second. When I started the cluster, JVM heap usage was low (about 2-4% on each node). With time, that keeps on growing and stabilizes in between 81-94%. Now, in the meanwhile I am just indexing and not querying data at all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was giving me long long pauses (about 13-17 seconds for garbage collection) which is not ideal. With G1, GC is frequent and quicker (so far I have seen about 1 second). This works but I am always concerned that the JVM Heap usage is so high and if a little more load is in the pipeline, then what will happen. Will ES be able to take it or there are chances of experiencing OutOfMemory exceptions, leading the node to go down. Obviously, this is something that I will have to test according to my use-case, but I am interested in knowing if there is someone around here who has experienced similar problems and have found the solution or a work around.

After some time, GC happens so quickly that I can make out that it is affecting indexing (I am indexing using a Rabbit consumer written in Python and after every 10-20 seconds, I'd see a peak in the queue, suggesting that the consumer is not able to consume, further suggesting that the consumer is not able to quickly write to ES, leading me to assume that GC is the cause of the slow write as the CPU is busy.

So:
* What are/could be the reasons of such heap usage? What is ES doing with so much heap?
* How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info


On 11 December 2013 08:22, Eugene Strokin <[hidden email]> wrote:
I use ES as a primary datasource from 0.2 version. It is in production for almost 2 years. Starting from 5 shards all on the same node, to 1 replica of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in average. All kind of requests, searches, filtering, sorting, faceting. I had transferring whole cluster even to different datacenters with zero down time several times. All problems I had was only because I did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra - not easy to support complex (and I wouldn't even call it complex really) data structure. Hbase on hadoop - too low level. And performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

joergprante@gmail.com
ES picks up translogs and replays them the next start after a node went down.

You work heavily with facets, and I share your concerns about OOMs inducing flakiness to the whole cluster.

Have you checked how your cache on the heap is used? Note that some caches are turned on by default and may interfere with your facets.

You should also look into the new aggregation framework of 1.0.0.Beta2 if the new faceting is less resource consuming.

Regarding the indexing, check if you have a strategy for segment merging and throttling. Setting custom values can take much pressure off the heap, especially when segments grow very large.

And finally, check if you can identify the sweet spot when to add nodes, if heap usage is just getting too high.

Jörg



On Wed, Dec 11, 2013 at 10:03 AM, Vaidik Kapoor <[hidden email]> wrote:
For our use-case, we need facets desperately. Otherwise we will have to do that in the application logic, which is not ideal and honestly a lot of work too. ES gives me that. However, with the number of documents we need to index per second (30-50 per second, and this number is going to grow with time), I wonder what do people do to make sure that:

* There are least chances of data loss. You cannot flush segments to disk very quickly as that won't be optimal and to my knowledge a lot of unoptimized segments will be created if I manually use the Flush API. So what does ES do, when data has been written to translog but the operations have not been flushed and the node goes down?
* If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM heap properly and how it can affect my cluster. Consider this: we have three nodes and we are indexing data in it at the rate of 30-50 docs per second. When I started the cluster, JVM heap usage was low (about 2-4% on each node). With time, that keeps on growing and stabilizes in between 81-94%. Now, in the meanwhile I am just indexing and not querying data at all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was giving me long long pauses (about 13-17 seconds for garbage collection) which is not ideal. With G1, GC is frequent and quicker (so far I have seen about 1 second). This works but I am always concerned that the JVM Heap usage is so high and if a little more load is in the pipeline, then what will happen. Will ES be able to take it or there are chances of experiencing OutOfMemory exceptions, leading the node to go down. Obviously, this is something that I will have to test according to my use-case, but I am interested in knowing if there is someone around here who has experienced similar problems and have found the solution or a work around.

After some time, GC happens so quickly that I can make out that it is affecting indexing (I am indexing using a Rabbit consumer written in Python and after every 10-20 seconds, I'd see a peak in the queue, suggesting that the consumer is not able to consume, further suggesting that the consumer is not able to quickly write to ES, leading me to assume that GC is the cause of the slow write as the CPU is busy.

So:
* What are/could be the reasons of such heap usage? What is ES doing with so much heap?
* How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info


On 11 December 2013 08:22, Eugene Strokin <[hidden email]> wrote:
I use ES as a primary datasource from 0.2 version. It is in production for almost 2 years. Starting from 5 shards all on the same node, to 1 replica of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in average. All kind of requests, searches, filtering, sorting, faceting. I had transferring whole cluster even to different datacenters with zero down time several times. All problems I had was only because I did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra - not easy to support complex (and I wouldn't even call it complex really) data structure. Hbase on hadoop - too low level. And performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoErrt6LDFDwMXKsqBmyQoVgi8hn6mDNNDXA1-u5cXt5bg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

davrob
Hi Jorg,

So if my index files get corrupted and I restore (using the new Snapshot API) the cluster, is there a way of moving the translogs from the old cluster nodes to the new one?

- David.

On Wednesday, 11 December 2013 11:10:46 UTC, Jörg Prante wrote:
ES picks up translogs and replays them the next start after a node went down.

You work heavily with facets, and I share your concerns about OOMs inducing flakiness to the whole cluster.

Have you checked how your cache on the heap is used? Note that some caches are turned on by default and may interfere with your facets.

You should also look into the new aggregation framework of 1.0.0.Beta2 if the new faceting is less resource consuming.

Regarding the indexing, check if you have a strategy for segment merging and throttling. Setting custom values can take much pressure off the heap, especially when segments grow very large.

And finally, check if you can identify the sweet spot when to add nodes, if heap usage is just getting too high.

Jörg



On Wed, Dec 11, 2013 at 10:03 AM, Vaidik Kapoor <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="iIny-z4IQwkJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kapoor...@...> wrote:
For our use-case, we need facets desperately. Otherwise we will have to do that in the application logic, which is not ideal and honestly a lot of work too. ES gives me that. However, with the number of documents we need to index per second (30-50 per second, and this number is going to grow with time), I wonder what do people do to make sure that:

* There are least chances of data loss. You cannot flush segments to disk very quickly as that won't be optimal and to my knowledge a lot of unoptimized segments will be created if I manually use the Flush API. So what does ES do, when data has been written to translog but the operations have not been flushed and the node goes down?
* If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM heap properly and how it can affect my cluster. Consider this: we have three nodes and we are indexing data in it at the rate of 30-50 docs per second. When I started the cluster, JVM heap usage was low (about 2-4% on each node). With time, that keeps on growing and stabilizes in between 81-94%. Now, in the meanwhile I am just indexing and not querying data at all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was giving me long long pauses (about 13-17 seconds for garbage collection) which is not ideal. With G1, GC is frequent and quicker (so far I have seen about 1 second). This works but I am always concerned that the JVM Heap usage is so high and if a little more load is in the pipeline, then what will happen. Will ES be able to take it or there are chances of experiencing OutOfMemory exceptions, leading the node to go down. Obviously, this is something that I will have to test according to my use-case, but I am interested in knowing if there is someone around here who has experienced similar problems and have found the solution or a work around.

After some time, GC happens so quickly that I can make out that it is affecting indexing (I am indexing using a Rabbit consumer written in Python and after every 10-20 seconds, I'd see a peak in the queue, suggesting that the consumer is not able to consume, further suggesting that the consumer is not able to quickly write to ES, leading me to assume that GC is the cause of the slow write as the CPU is busy.

So:
* What are/could be the reasons of such heap usage? What is ES doing with so much heap?
* How can I keep that in control?

Thanks

Vaidik Kapoor
<a href="http://vaidikkapoor.info" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fvaidikkapoor.info\46sa\75D\46sntz\0751\46usg\75AFQjCNH5zKfq-xDwNXP5NRZ6kj1jIUZ_Iw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fvaidikkapoor.info\46sa\75D\46sntz\0751\46usg\75AFQjCNH5zKfq-xDwNXP5NRZ6kj1jIUZ_Iw';return true;">vaidikkapoor.info


On 11 December 2013 08:22, Eugene Strokin <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="iIny-z4IQwkJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">eug...@...> wrote:
I use ES as a primary datasource from 0.2 version. It is in production for almost 2 years. Starting from 5 shards all on the same node, to 1 replica of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in average. All kind of requests, searches, filtering, sorting, faceting. I had transferring whole cluster even to different datacenters with zero down time several times. All problems I had was only because I did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra - not easy to support complex (and I wouldn't even call it complex really) data structure. Hbase on hadoop - too low level. And performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="iIny-z4IQwkJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com';return true;">https://groups.google.com/d/msgid/elasticsearch/791b3323-fcf9-47fb-b874-e7e7b9feb7c9%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/groups/opt_out" target="_blank" onmousedown="this.href='https://groups.google.com/groups/opt_out';return true;" onclick="this.href='https://groups.google.com/groups/opt_out';return true;">https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="iIny-z4IQwkJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com';return true;">https://groups.google.com/d/msgid/elasticsearch/CACWtv5mTCSmU73MFyMRWJUvbiMXzGj%3DPssLWc7rwjuD-ZAOwVg%40mail.gmail.com.

For more options, visit <a href="https://groups.google.com/groups/opt_out" target="_blank" onmousedown="this.href='https://groups.google.com/groups/opt_out';return true;" onclick="this.href='https://groups.google.com/groups/opt_out';return true;">https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af52a997-6880-41ca-afcc-c9ac47a9098c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Using ElasticSearch as Primary Data Store

Matt Weber-2
For those of you concerned with OOM and heap issues due to facets and/or caching I would consider waiting for 1.0 which will have:

Field Data Circuit breaker to limit how much memory is used.

Disk based fielddata/docvalues

The docvalues is in beta2, I imagine the circuit breaker will be in the next.  On top of that there has been a lot of work going on to reduce the number of objects created so there will be less GC's in general.

Thanks,
Matt Weber
 


On Wed, Dec 11, 2013 at 8:22 AM, davrob2 <[hidden email]> wrote:
Hi Jorg,

So if my index files get corrupted and I restore (using the new Snapshot API) the cluster, is there a way of moving the translogs from the old cluster nodes to the new one?

- David.


On Wednesday, 11 December 2013 11:10:46 UTC, Jörg Prante wrote:
ES picks up translogs and replays them the next start after a node went down.

You work heavily with facets, and I share your concerns about OOMs inducing flakiness to the whole cluster.

Have you checked how your cache on the heap is used? Note that some caches are turned on by default and may interfere with your facets.

You should also look into the new aggregation framework of 1.0.0.Beta2 if the new faceting is less resource consuming.

Regarding the indexing, check if you have a strategy for segment merging and throttling. Setting custom values can take much pressure off the heap, especially when segments grow very large.

And finally, check if you can identify the sweet spot when to add nodes, if heap usage is just getting too high.

Jörg



On Wed, Dec 11, 2013 at 10:03 AM, Vaidik Kapoor <[hidden email]> wrote:
For our use-case, we need facets desperately. Otherwise we will have to do that in the application logic, which is not ideal and honestly a lot of work too. ES gives me that. However, with the number of documents we need to index per second (30-50 per second, and this number is going to grow with time), I wonder what do people do to make sure that:

* There are least chances of data loss. You cannot flush segments to disk very quickly as that won't be optimal and to my knowledge a lot of unoptimized segments will be created if I manually use the Flush API. So what does ES do, when data has been written to translog but the operations have not been flushed and the node goes down?
* If there is scope for data loss, then what does one do to detect it?

Being new to ES, I am still trying to understand where is ES using JVM heap properly and how it can affect my cluster. Consider this: we have three nodes and we are indexing data in it at the rate of 30-50 docs per second. When I started the cluster, JVM heap usage was low (about 2-4% on each node). With time, that keeps on growing and stabilizes in between 81-94%. Now, in the meanwhile I am just indexing and not querying data at all from the cluster. I am using G1-GC instead of CMS-GC because CMS-GC was giving me long long pauses (about 13-17 seconds for garbage collection) which is not ideal. With G1, GC is frequent and quicker (so far I have seen about 1 second). This works but I am always concerned that the JVM Heap usage is so high and if a little more load is in the pipeline, then what will happen. Will ES be able to take it or there are chances of experiencing OutOfMemory exceptions, leading the node to go down. Obviously, this is something that I will have to test according to my use-case, but I am interested in knowing if there is someone around here who has experienced similar problems and have found the solution or a work around.

After some time, GC happens so quickly that I can make out that it is affecting indexing (I am indexing using a Rabbit consumer written in Python and after every 10-20 seconds, I'd see a peak in the queue, suggesting that the consumer is not able to consume, further suggesting that the consumer is not able to quickly write to ES, leading me to assume that GC is the cause of the slow write as the CPU is busy.

So:
* What are/could be the reasons of such heap usage? What is ES doing with so much heap?
* How can I keep that in control?

Thanks

Vaidik Kapoor
vaidikkapoor.info


On 11 December 2013 08:22, Eugene Strokin <[hidden email]> wrote:
I use ES as a primary datasource from 0.2 version. It is in production for almost 2 years. Starting from 5 shards all on the same node, to 1 replica of those 5 shards on 3 nodes. Serves about a doezen requests per seconds in average. All kind of requests, searches, filtering, sorting, faceting. I had transferring whole cluster even to different datacenters with zero down time several times. All problems I had was only because I did something wrong, but it wasn't fault of ES.
So, I could say now, that ES could be used as the only data store.
I'v tried several other options, like Solar - too hard to scale, Cassandra - not easy to support complex (and I wouldn't even call it complex really) data structure. Hbase on hadoop - too low level. And performance of ES is very impressive comparing to others.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af52a997-6880-41ca-afcc-c9ac47a9098c%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoAvO6085KeUo0yf-mBQg0qgrF86g-cZL%2Bb-EM-DhDwFiA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.