30 billion unique documents (and counting)

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

30 billion unique documents (and counting)

fdevillamil
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Kimbro Staken
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <[hidden email]> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Mark Walkom-2
If you are using time series data then you should be using time series indices. As Fred pointed out, routing an entire month's worth of data to a single shard is not going to scale.

Also, we recommend that you keep shard size below 50GB, this helps with recovery and distribution. There is also a hard 2 billion doc per shard limit in the underlying lucene engine, if you hit this then you may lose data.

On 23 April 2015 at 03:12, Kimbro Staken <[hidden email]> wrote:
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <[hidden email]> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Jack Park-2
In reply to this post by Kimbro Staken
I would certainly like to see that book, or at least a draft of it ;-)

On Wed, Apr 22, 2015 at 10:12 AM, Kimbro Staken <[hidden email]> wrote:
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <[hidden email]> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH6s0fw07o07c6TFYnujh6wRka-gQK8wpYruWz6UJz1qiVUUQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Jack Park-2
In reply to this post by Mark Walkom-2
That starts to argue for lots of smaller servers maybe even with smaller SSD's. Say, a low power i3 with 16 or 23gb ram, and a 128gb SSD. Is that right?

On Wed, Apr 22, 2015 at 2:56 PM, Mark Walkom <[hidden email]> wrote:
If you are using time series data then you should be using time series indices. As Fred pointed out, routing an entire month's worth of data to a single shard is not going to scale.

Also, we recommend that you keep shard size below 50GB, this helps with recovery and distribution. There is also a hard 2 billion doc per shard limit in the underlying lucene engine, if you hit this then you may lose data.

On 23 April 2015 at 03:12, Kimbro Staken <[hidden email]> wrote:
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <[hidden email]> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Mark Walkom-2
​Not really, the smaller server you mentioned before would be suitable.​


On 23 April 2015 at 11:04, Jack Park <[hidden email]> wrote:
That starts to argue for lots of smaller servers maybe even with smaller SSD's. Say, a low power i3 with 16 or 23gb ram, and a 128gb SSD. Is that right?

On Wed, Apr 22, 2015 at 2:56 PM, Mark Walkom <[hidden email]> wrote:
If you are using time series data then you should be using time series indices. As Fred pointed out, routing an entire month's worth of data to a single shard is not going to scale.

Also, we recommend that you keep shard size below 50GB, this helps with recovery and distribution. There is also a hard 2 billion doc per shard limit in the underlying lucene engine, if you hit this then you may lose data.

On 23 April 2015 at 03:12, Kimbro Staken <[hidden email]> wrote:
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <[hidden email]> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9X5HrL_1wF-iC0qhSUm-ONxeRfF04t8Wc%2B5Kw7uS%2Bw3g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Kimbro Staken
In reply to this post by Jack Park-2
Running ES at scale is all about balance and sizing right. Like the 3 bears, not too big and not too small, just right. Big boxes will just be wasted and too small of boxes will have you hitting limits too soon. Given the way java works with heaps above 30GB-ish the best size for a node right now seems to be around 64GB RAM. More RAM will be utilized by the OS cache and definitely is a good thing but when you start to survey HW options and consider other factors like power, rack density and per node cost big boxes don't balance out. 

The 50GB per shard number is just that, per shard. An individual node can (and should) hold dozens of shards. Larger shard sizes will work too but when a node crashes recovery of a larger number of 50GB shards will be much faster than a smaller number of 200GB shards, especially in a large cluster.

Kimbro Staken

On Wed, Apr 22, 2015 at 6:04 PM, Jack Park <[hidden email]> wrote:
That starts to argue for lots of smaller servers maybe even with smaller SSD's. Say, a low power i3 with 16 or 23gb ram, and a 128gb SSD. Is that right?

On Wed, Apr 22, 2015 at 2:56 PM, Mark Walkom <[hidden email]> wrote:
If you are using time series data then you should be using time series indices. As Fred pointed out, routing an entire month's worth of data to a single shard is not going to scale.

Also, we recommend that you keep shard size below 50GB, this helps with recovery and distribution. There is also a hard 2 billion doc per shard limit in the underlying lucene engine, if you hit this then you may lose data.

On 23 April 2015 at 03:12, Kimbro Staken <[hidden email]> wrote:
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <[hidden email]> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZ%3DVyK3aJ%2Bty%3DK_%2B9qdSeQ9Pm_s7kf6EozoddM5b6Z0eQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: 30 billion unique documents (and counting)

Alexandre Heimburger-2
Regarding 1 index per month, what about search performances when searching on 2 or 3 years ?

Le jeudi 23 avril 2015 03:20:33 UTC+2, Kimbro Staken a écrit :
Running ES at scale is all about balance and sizing right. Like the 3 bears, not too big and not too small, just right. Big boxes will just be wasted and too small of boxes will have you hitting limits too soon. Given the way java works with heaps above 30GB-ish the best size for a node right now seems to be around 64GB RAM. More RAM will be utilized by the OS cache and definitely is a good thing but when you start to survey HW options and consider other factors like power, rack density and per node cost big boxes don't balance out. 

The 50GB per shard number is just that, per shard. An individual node can (and should) hold dozens of shards. Larger shard sizes will work too but when a node crashes recovery of a larger number of 50GB shards will be much faster than a smaller number of 200GB shards, especially in a large cluster.

Kimbro Staken

On Wed, Apr 22, 2015 at 6:04 PM, Jack Park <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">jack...@...> wrote:
That starts to argue for lots of smaller servers maybe even with smaller SSD's. Say, a low power i3 with 16 or 23gb ram, and a 128gb SSD. Is that right?

On Wed, Apr 22, 2015 at 2:56 PM, Mark Walkom <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">markw...@...> wrote:
If you are using time series data then you should be using time series indices. As Fred pointed out, routing an entire month's worth of data to a single shard is not going to scale.

Also, we recommend that you keep shard size below 50GB, this helps with recovery and distribution. There is also a hard 2 billion doc per shard limit in the underlying lucene engine, if you hit this then you may lose data.

On 23 April 2015 at 03:12, Kimbro Staken <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kst...@...> wrote:
Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences on that would require a book, but a couple quick things that jumped out at me.

1. do not go the huge server route. Elasticasearch works best when you scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a single shard? By doing that you're directing all activity on that shard to a single machine, or small set of machines if you have replicas. That has to be much slower than if you were to do something like use a monthly index with a reasonable number of shards to spread that load across the cluster. That is also creating shard sizes that are fairly large and if you have month to month variation in data rates you'll end up with "lumpy" shard sizes which will definitely cause issues if you ever run your cluster low on disk space. 

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now. Managing heap memory is let's be nice and call it "a challenge" and fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">fdevi...@...> wrote:
Hi list,

I've been using ES in production since 0.17.6 with clusters up to 64 virtual machines and 20T data (including 3 replica). We're now thinking about pushing things a bit further and I wondered if people here had similar experience / needs as we do.

Our current index is 1.1 billion unique documents, 8Tb data (including 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned). We're indexing about 2500 new documents / second and everything's fine so far.

Our goal is to index (and search) about 30 billion more documents (the backdata) + about 200 million new documents each month. 

Our company is providing analytics dashboards to their clients, and they mostly browse their data on a monthly scale, so we're routing documents monthly. Each shard makes between 200 and 250G. The index is made of 128 shards, which makes about 10 years of data with 1 month per shard. Considering what we already have, we should reach 240T of data (and counting) with a single replica after we index all our backdata.

So, my questions here:

- Has someone here the same use / amount of data as we do? 

- Is ES the right technology to do realtime, ligthspeed queries (filtered queries and high cardinality agregations) on such an amount of data?

- What were the traps to avoid? Is it better to add lots of medium machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a few huge machines with petabytes of RAM, terabytes of SSD and multiple ES processes?

Any feedback on similar situation is indeed appreciated.

Have a nice day,
Fred

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="pw2kHrvETIYJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/504b0be5-80b0-4db6-8137-29ebfa029eb3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.