merge policy tuning hint

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

merge policy tuning hint

arta
This post was updated on .
Hi,
I think I am seeing node performance degrades when the merge count ("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to 10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the Underworld] [i11][25] using [tiered] merge policy with expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10], max_merge_at_once_explicit[30], max_merged_segment[5gb], segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the Underworld] [i11][25] using [concurrent] merge scheduler with max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped was doing many merges, according to the log. The time used to each merge seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you in advance.
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Radu Gheorghe-2
Hello,

You can reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Another thing that might help is to lower the index.merge.scheduler.max_thread_count, especially since the default is based on the number of processors. And you have many of those :)

Documentation for all the above settings can be found here:
http://www.elasticsearch.org/guide/reference/index-modules/merge.html

250 shards per node is a lot. Can you bring that number down? How many indices do you have?

Another interesting piece of information is what's overloaded when a node drops (CPU, IO,etc)? Any interesting information in the logs?

If it's CPU, lowering the number of concurrent merges and threads should help. If it's IO, you might additionally look at store level throttling, especially for the type "merge":
http://www.elasticsearch.org/guide/reference/index-modules/store.html

Best regards,
Radu
--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Dec 13, 2012 at 10:24 PM, arta <[hidden email]> wrote:
>
> Hi,
> I think I am seeing node performance degrades when the merge count
> ("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
> 10, and that results the node drop from the cluster.
>
> I enabled index.merge DEBUG log then I see following logs for each shard:
>
> [2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
> Underworld] [i11][25] using [tiered] merge policy with
> expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
> max_merge_at_once_explicit[30], max_merged_segment[5gb],
> segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
> [2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
> Underworld] [i11][25] using [concurrent] merge scheduler with
> max_thread_count[3]
>
> I'm using default merge policy on each node with 16 cores.
> num_replicas is 1.
> Each node has around 250 shards including primary and secondary shards.
> ES is 0.19.3.
>
> This morning I encountered the node drop, and at that time, the node dropped
> was doing many merges, according to the log. The time used to each merge
> seems increasing when the node dropped.
> It started with relatively short time like this:
>
> [2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
> Underworld] [i13][27] merge [_72z9] done, took [54.8s]
> (the size of this shard is 4.2GB)
>
> Then the time increased to minutes (7:50 is the time when this node dropped
> from the cluster):
>
> [2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
> Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
> (the size of this shard is 5.8GB)
>
> It kept increasing:
>
> [2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
> Underworld] [i531][22] merge [_6293] done, took [11m]
> (the size of this shard is 3.6GB)
>
> I'm not sure this is related to my problem (node drop), but my gut feeling
> is I might need some tuning on the merge policy.
>
> Any hints or guidance would be very much appreciated.
> Thank you for advance.
>
>
>
> --
> View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>
> --
>
>


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Randall McRee
In reply to this post by arta
Our experience has been that when a node drops in an elastic cluster something pretty basic is wrong. 
For example, you have too little memory, too little cpu, not enough disc, not enough iops to support your workload.

Radhu courteously answered your original query about the segment merge policy, but I fear that is not your actual problem and will not really help you. 

If you have 250 shards then you are putting tremendous pressure on the I/O system whenever you are indexing. Lucene's architecture is carefully designed to limit the amount of I/O on a *per* index basis. But by having 250 shards you are abusing this architecture, and this architecture is not working for you, I suspect. Tweaking those parameters will probably just change the time at which a node drops. Have you tried answering queries while indexing? This will put more pressure on your cluster.

Of course, the community can not help you in any real way without knowing real basics like:

What are you trying to accomplish?
How many items are you indexing and what is their basic structure?
How many nodes do you have in a cluster, how much memory, how much disc?
Why did you choose 250 shards? Our experience has been that the number of shards should be somewhere around the number of nodes for each index plus a factor of ~two (but not 200).

Hope this helps,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

arta
In reply to this post by Radu Gheorghe-2
Thank you Radu.
I did check the merge policy document.
My understanding by reading it is:
- max_merge_at_once: smaller value=less load at merging but longer merge period
- max_merged_segment: smaller value=less load at merging but may slow down search
- segments_per_tier: smaller value=more load at merging but faster search
- max_thread_count: default is 3 for my 16 core machine = max(1, min(3, 16/2))

It seems to me they all have goods and bads, so that's why I asked for experts guidance.

As for # of shards per node, what number would be the ideal? 2, 5, 10, 20 50, or 100?

I wanted to have more than a couple of million indices at the beginning because documents we have can be well partitioned into that many small groups and searches are done within each group.
But it did not sound realistic, so I ended up with less than 100 indices, combining a couple of dozen thousand of these groups into one index. In order to achieve search performance based on the independence characteristics of the groups, I use routing. To make the routing more effective, I wanted to have as many shards as possible. Then with some research, I came up with a number, which is 32 shards per index.
This is the history why I have about 250 shards per node now.

Reading your feedback, I guess I made a bad decision and should have gone with smaller number of shards.
But, since we can't change the parameter # of shards per index, I want to handle what I have.

Our usage is 50% get, 45% index, and 5% search.
Overall traffic varies, but approx. 50-60 gets/sec, and 40-50 index/sec
We have about 20 data nodes for about 5000 shards.

I experimentally changed JVM heap size from 24GB to 16GB (the machine has 32GB) and the problem  started to happen more frequently, so I changed it back to 24GB.
Perhaps I need more memory and/or more nodes, but if I can work around the problem by tweaking some parameters, that would be great for short time.

Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

arta
In reply to this post by Randall McRee
Thank you for the input, Randall.
I replied to Radu's input above and there I wrote my history about 250 shards.
You said:
> Our experience has been that the number of shards should be somewhere around the number of nodes for each index plus a factor of ~two (but not 200).
Does that mean, in our case ~80 indices with 32 shards and 1 replica, I should have 64 nodes?
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Randall McRee
Well, I am still not sure what the use case is that drives you to think you need so many indices. What I have done in the past is to simply have a doctype field which distinguishes the different types of things that are being indexed and put all items into the same index. Lucene handles this really well--it efficiently restricts itself to the subset of the indexing universe specified by whatever doctype has been specified as part of the query. This has the advantage that you can then write queries over multiple doctypes that would have required merging from separate indices. Although elastic makes this easier than a lucene-only system I can't help thinking that you would be better served with the one index approach.

So yes, if you really need 80 indices I would try to have 32-64 nodes each hosting a small number of shards. There is no reason that the whole cluster needs to be involved in every query, right?

If you have found that you need 24gb of heap that implies that a host probably needs at least 48gb of memory by the rule of thumb indicating that no more than half of heap should be devoted to the elastic jvm. I have also found this to be true: we are currently at 19gb for the jvm out of 54gb for the node.

But, again, by going to a few shards per index you can probably drastically reduce the heap size. That is what I think you should try to do.

On Mon, Dec 17, 2012 at 11:50 AM, arta <[hidden email]> wrote:
Thank you for the input, Randall.
I replied to Radu's input above and there I wrote my history about 250
shards.
You said:
> Our experience has been that the number of shards should be somewhere
> around the number of nodes for each index plus a factor of ~two (but not
> 200).
Does that mean, in our case ~80 indices with 32 shards and 1 replica, I
should have 64 nodes?




--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976p4027095.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

eshrago
(I'm a colleague of arta...)

RKM,

The reason behind the number of indices was to logically "tie" together the backends we were sourcing our indexing from...at this point we can't afford to restructure our cluster and reindex all of our data. 

Looking at heap usage, although we have 24GB of heap allocated, committed usage seems to hover around 18GB. We also have a RAID flash cache so any writes that get sent to disk are written to the cache first, so disk I/O hasn't really been much above 50-60%, even during our heaviest indexing...

When looking at a graph of our concurrent merges for the past month, the gradual increase (then sudden spike) of merge activity coincides with our node disconnects, so that's why we've been focused on trying to prevent a merge race condition for recurring: http://imgur.com/oWTw4 

Does this provide any further clarity as to what may be happening here?

On Monday, December 17, 2012 1:33:19 PM UTC-8, RKM wrote:
Well, I am still not sure what the use case is that drives you to think you need so many indices. What I have done in the past is to simply have a doctype field which distinguishes the different types of things that are being indexed and put all items into the same index. Lucene handles this really well--it efficiently restricts itself to the subset of the indexing universe specified by whatever doctype has been specified as part of the query. This has the advantage that you can then write queries over multiple doctypes that would have required merging from separate indices. Although elastic makes this easier than a lucene-only system I can't help thinking that you would be better served with the one index approach.

So yes, if you really need 80 indices I would try to have 32-64 nodes each hosting a small number of shards. There is no reason that the whole cluster needs to be involved in every query, right?

If you have found that you need 24gb of heap that implies that a host probably needs at least 48gb of memory by the rule of thumb indicating that no more than half of heap should be devoted to the elastic jvm. I have also found this to be true: we are currently at 19gb for the jvm out of 54gb for the node.

But, again, by going to a few shards per index you can probably drastically reduce the heap size. That is what I think you should try to do.

On Mon, Dec 17, 2012 at 11:50 AM, arta <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ltFsaiywheUJ">arta...@...> wrote:
Thank you for the input, Randall.
I replied to Radu's input above and there I wrote my history about 250
shards.
You said:
> Our experience has been that the number of shards should be somewhere
> around the number of nodes for each index plus a factor of ~two (but not
> 200).
Does that mean, in our case ~80 indices with 32 shards and 1 replica, I
should have 64 nodes?




--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976p4027095.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Randall McRee
In reply to this post by arta
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

eshrago
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="1c97C6SYxaIJ">arta...@...> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Otis Gospodnetic
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Jim Wong
I work with Art and Ethan. A little background: we didn't target a specific number of shards per server; instead we divided our user base into partitions, and broke each partition up into a fixed number of shards, with the built-in assumption that the indexes will get much, much bigger in the future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days after the cluster is started up.  Then, we notice the number of current merges (as reported by ES) start to grow until finally nodes start dropping out and everything falls apart. We are not sure whether the current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3 days,when the number of merges seems relatively stable and the cluster is performing great, and the next several days, when the number of merges jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine - the utilization percentage reported by iostat is under 50%. CPU seems to increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of merges has started to climb, but no one has dropped out of the cluster (yet). The pattern is consistent enough that we can look at the merge activity graphs and anticipate when the next failure will occur.


On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Randall McRee
This would seem to be relevant:


Elastic is simply building on top of lucene. You can "tweak" the merge factor but to get good results but you will need to understand the extensive tradeoffs involved. However, its a slight investment compared to what you have done so far.

Still, looking at your system from a high-level I believe that merge factors will be a band-aid and the only real relief is to re-architect such that you have a "reasonable" number of shards per node. For example, if are really needing 24gb of jvm heap then you probably need (at least) 48gb of memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong <[hidden email]> wrote:
I work with Art and Ethan. A little background: we didn't target a specific number of shards per server; instead we divided our user base into partitions, and broke each partition up into a fixed number of shards, with the built-in assumption that the indexes will get much, much bigger in the future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days after the cluster is started up.  Then, we notice the number of current merges (as reported by ES) start to grow until finally nodes start dropping out and everything falls apart. We are not sure whether the current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3 days,when the number of merges seems relatively stable and the cluster is performing great, and the next several days, when the number of merges jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine - the utilization percentage reported by iostat is under 50%. CPU seems to increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of merges has started to climb, but no one has dropped out of the cluster (yet). The pattern is consistent enough that we can look at the merge activity graphs and anticipate when the next failure will occur.


On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Jim Wong
Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to this problem?

Under normal circumstances (the first couple days after the cluster comes up), there don't seem to be any obvious resource constraints arising from the large number of shards: CPU utilization is very reasonable (perhaps around 30%), and I/O utilization seems reasonable, as well (under 40%). GC logging seems to indicate that we're not having issues on that front.  Everything seems fine and dandy for a couple days.

Over time, however, _something_ seems to degrade: the number of merges the system completes decreases dramatically, and the number of slow merges increases (based on the debug logs ES produces). This is an interesting period, because the cluster outwardly seems to be healthy, but we can tell from the metrics we're collecting that things are starting go downhill. CPU and disk I/O still seem fine, but for some reason there seems to be a problem with merges.

Eventually - a few days after things start to slow down - the number of threads in use skyrockets, the heap gets completely consumed, and nodes start dropping out of the cluster. This is the end stage of the disease: we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that illustrate what's happening. To solve the problem, we'd like to understand what resource we're low on before we jump into re-architecting the cluster, lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one of our nodes while leaving the heap size unchanged to see if it changes the pattern.  I'm a little pessimistic on this front because I/O doesn't seem to be a huge bottleneck right now (based on the output of iostat), but it does seem possible that just fractionally improving merge performance through additional buffer space could help if we are very slowly falling behind.


On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:
This would seem to be relevant:


Elastic is simply building on top of lucene. You can "tweak" the merge factor but to get good results but you will need to understand the extensive tradeoffs involved. However, its a slight investment compared to what you have done so far.

Still, looking at your system from a high-level I believe that merge factors will be a band-aid and the only real relief is to re-architect such that you have a "reasonable" number of shards per node. For example, if are really needing 24gb of jvm heap then you probably need (at least) 48gb of memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="nXaCYnlS9WgJ">jim...@...> wrote:
I work with Art and Ethan. A little background: we didn't target a specific number of shards per server; instead we divided our user base into partitions, and broke each partition up into a fixed number of shards, with the built-in assumption that the indexes will get much, much bigger in the future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days after the cluster is started up.  Then, we notice the number of current merges (as reported by ES) start to grow until finally nodes start dropping out and everything falls apart. We are not sure whether the current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3 days,when the number of merges seems relatively stable and the cluster is performing great, and the next several days, when the number of merges jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine - the utilization percentage reported by iostat is under 50%. CPU seems to increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of merges has started to climb, but no one has dropped out of the cluster (yet). The pattern is consistent enough that we can look at the merge activity graphs and anticipate when the next failure will occur.


On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Jim Wong
Sorry for the double-post, and apologies in advance if posting attachments to the group is verboten.

Here are graphs that show two values for the last week:

- merges-current.jpg shows the number of current merges as reported by ES. Note that this number stayed low after midday Friday (when we "fixed" things the last time) through sometime Sunday, when things started to go downhill.

- merges-total.jpg shows the rate of merges reported by ES, calculated by subtracting the total merges metric at repeating time intervals.

Lining these two graphs up, it looks like things are good Friday and Saturday: we're merging quite a bit, but not that many are active at any given time, which seems to indicate things are going fast.

Starting on Sunday, the number of concurrent merges starts climbing, and the total number of merges completed drops way off.  Our conclusion is that merges have started taking longer for some reason, which leads to more running at the same time.

You can disregard the changes at the end of the day Wednesday - that's when we started fixing things by restarting nodes. If the previous pattern holds, things will be OK again for a couple days.

We also have things set up with debug logging so that ES prints a message like this when a merge completes and takes more than 20 seconds:

[2012-12-15 12:59:36,520][DEBUG][index.merge.scheduler    ] [Matsu'o Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]

Picking a node at random, here are the number of slow merges on Saturday (healthy), Monday (unhealthy, but still working) and Wednesday (going down the drain):

Saturday: 7
Monday: 15
Wednesday: 22 

Based on this information, the number of slow merges seems to be increasing, while the total number of merges completed seems to be decreasing.

Still, I/O and CPU seem reasonably healthy:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.84    0.00    1.90    1.19    0.00   73.07

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.67  0.00  1.67     0.00     0.01    11.20     0.00    0.60   0.20   0.03
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.67  0.00  1.67     0.00     0.01    11.20     0.00    0.60   0.20   0.03
sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00  1993.00 19.67 1207.67     0.15    12.47    21.07     4.40    3.18   0.19  22.93


On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:
Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to this problem?

Under normal circumstances (the first couple days after the cluster comes up), there don't seem to be any obvious resource constraints arising from the large number of shards: CPU utilization is very reasonable (perhaps around 30%), and I/O utilization seems reasonable, as well (under 40%). GC logging seems to indicate that we're not having issues on that front.  Everything seems fine and dandy for a couple days.

Over time, however, _something_ seems to degrade: the number of merges the system completes decreases dramatically, and the number of slow merges increases (based on the debug logs ES produces). This is an interesting period, because the cluster outwardly seems to be healthy, but we can tell from the metrics we're collecting that things are starting go downhill. CPU and disk I/O still seem fine, but for some reason there seems to be a problem with merges.

Eventually - a few days after things start to slow down - the number of threads in use skyrockets, the heap gets completely consumed, and nodes start dropping out of the cluster. This is the end stage of the disease: we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that illustrate what's happening. To solve the problem, we'd like to understand what resource we're low on before we jump into re-architecting the cluster, lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one of our nodes while leaving the heap size unchanged to see if it changes the pattern.  I'm a little pessimistic on this front because I/O doesn't seem to be a huge bottleneck right now (based on the output of iostat), but it does seem possible that just fractionally improving merge performance through additional buffer space could help if we are very slowly falling behind.


On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:
This would seem to be relevant:


Elastic is simply building on top of lucene. You can "tweak" the merge factor but to get good results but you will need to understand the extensive tradeoffs involved. However, its a slight investment compared to what you have done so far.

Still, looking at your system from a high-level I believe that merge factors will be a band-aid and the only real relief is to re-architect such that you have a "reasonable" number of shards per node. For example, if are really needing 24gb of jvm heap then you probably need (at least) 48gb of memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong <[hidden email]> wrote:
I work with Art and Ethan. A little background: we didn't target a specific number of shards per server; instead we divided our user base into partitions, and broke each partition up into a fixed number of shards, with the built-in assumption that the indexes will get much, much bigger in the future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days after the cluster is started up.  Then, we notice the number of current merges (as reported by ES) start to grow until finally nodes start dropping out and everything falls apart. We are not sure whether the current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3 days,when the number of merges seems relatively stable and the cluster is performing great, and the next several days, when the number of merges jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine - the utilization percentage reported by iostat is under 50%. CPU seems to increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of merges has started to climb, but no one has dropped out of the cluster (yet). The pattern is consistent enough that we can look at the merge activity graphs and anticipate when the next failure will occur.


On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Jim Wong
After all that, I forgot the attachments.

On Wednesday, December 19, 2012 11:47:06 PM UTC-8, Jim Wong wrote:
Sorry for the double-post, and apologies in advance if posting attachments to the group is verboten.

Here are graphs that show two values for the last week:

- merges-current.jpg shows the number of current merges as reported by ES. Note that this number stayed low after midday Friday (when we "fixed" things the last time) through sometime Sunday, when things started to go downhill.

- merges-total.jpg shows the rate of merges reported by ES, calculated by subtracting the total merges metric at repeating time intervals.

Lining these two graphs up, it looks like things are good Friday and Saturday: we're merging quite a bit, but not that many are active at any given time, which seems to indicate things are going fast.

Starting on Sunday, the number of concurrent merges starts climbing, and the total number of merges completed drops way off.  Our conclusion is that merges have started taking longer for some reason, which leads to more running at the same time.

You can disregard the changes at the end of the day Wednesday - that's when we started fixing things by restarting nodes. If the previous pattern holds, things will be OK again for a couple days.

We also have things set up with debug logging so that ES prints a message like this when a merge completes and takes more than 20 seconds:

[2012-12-15 12:59:36,520][DEBUG][index.merge.scheduler    ] [Matsu'o Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]

Picking a node at random, here are the number of slow merges on Saturday (healthy), Monday (unhealthy, but still working) and Wednesday (going down the drain):

Saturday: 7
Monday: 15
Wednesday: 22 

Based on this information, the number of slow merges seems to be increasing, while the total number of merges completed seems to be decreasing.

Still, I/O and CPU seem reasonably healthy:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.84    0.00    1.90    1.19    0.00   73.07

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.67  0.00  1.67     0.00     0.01    11.20     0.00    0.60   0.20   0.03
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.67  0.00  1.67     0.00     0.01    11.20     0.00    0.60   0.20   0.03
sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00  1993.00 19.67 1207.67     0.15    12.47    21.07     4.40    3.18   0.19  22.93


On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:
Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to this problem?

Under normal circumstances (the first couple days after the cluster comes up), there don't seem to be any obvious resource constraints arising from the large number of shards: CPU utilization is very reasonable (perhaps around 30%), and I/O utilization seems reasonable, as well (under 40%). GC logging seems to indicate that we're not having issues on that front.  Everything seems fine and dandy for a couple days.

Over time, however, _something_ seems to degrade: the number of merges the system completes decreases dramatically, and the number of slow merges increases (based on the debug logs ES produces). This is an interesting period, because the cluster outwardly seems to be healthy, but we can tell from the metrics we're collecting that things are starting go downhill. CPU and disk I/O still seem fine, but for some reason there seems to be a problem with merges.

Eventually - a few days after things start to slow down - the number of threads in use skyrockets, the heap gets completely consumed, and nodes start dropping out of the cluster. This is the end stage of the disease: we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that illustrate what's happening. To solve the problem, we'd like to understand what resource we're low on before we jump into re-architecting the cluster, lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one of our nodes while leaving the heap size unchanged to see if it changes the pattern.  I'm a little pessimistic on this front because I/O doesn't seem to be a huge bottleneck right now (based on the output of iostat), but it does seem possible that just fractionally improving merge performance through additional buffer space could help if we are very slowly falling behind.


On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:
This would seem to be relevant:


Elastic is simply building on top of lucene. You can "tweak" the merge factor but to get good results but you will need to understand the extensive tradeoffs involved. However, its a slight investment compared to what you have done so far.

Still, looking at your system from a high-level I believe that merge factors will be a band-aid and the only real relief is to re-architect such that you have a "reasonable" number of shards per node. For example, if are really needing 24gb of jvm heap then you probably need (at least) 48gb of memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong <[hidden email]> wrote:
I work with Art and Ethan. A little background: we didn't target a specific number of shards per server; instead we divided our user base into partitions, and broke each partition up into a fixed number of shards, with the built-in assumption that the indexes will get much, much bigger in the future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days after the cluster is started up.  Then, we notice the number of current merges (as reported by ES) start to grow until finally nodes start dropping out and everything falls apart. We are not sure whether the current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3 days,when the number of merges seems relatively stable and the cluster is performing great, and the next several days, when the number of merges jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine - the utilization percentage reported by iostat is under 50%. CPU seems to increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of merges has started to climb, but no one has dropped out of the cluster (yet). The pattern is consistent enough that we can look at the merge activity graphs and anticipate when the next failure will occur.


On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 

--
 
 

merges-current.png (141K) Download Attachment
merges-total.png (46K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: merge policy tuning hint

Randall McRee
Hi Jim,
Nice graphs.  It seems what we are missing, though, is the indexing volume. Presumably if you aren't adding to the index merges would not be happening at all. Is the indexing more or less constant over the time period shown? Is it constant with respect to the different indices, as well?

If constant across indices that would seem to indicate that your bigger merges are simply taking longer. (Again, look at the McCandless blog for why this is).

Since you have so many indices have you tried putting them all on different merge schedules? It seems like that would be trivial by one or more of: a) setting different merge policy; b) setting different index.merge.policy.segments_per_tier values c) optimizing segments at various time for various indices. I've set different segments_per_tier from 3 to 50 depending on whether I was trying to get the best query or indexing performance. So experiment. My current working theory is that your cluster dies since 79*32 indices are all trying to merge around the same time.


So two concrete things to try:
1) different merge schedules for all indices (should at least change that graph if this is the problem)
2) fewer shards per index (try one or two, instead of 32).

I've been poking at memory because in my elastic system and all of the ones I've simply read about on this newsgroup it seems to be a scarce resource: the OS needs copious amounts to perform lucene I/O and elastic needs lots to do its thing. I ended up with twice as much memory as we had initially planned for in our cluster, for example. 

Good luck,
Randy


On Wed, Dec 19, 2012 at 11:47 PM, Jim Wong <[hidden email]> wrote:
After all that, I forgot the attachments.


On Wednesday, December 19, 2012 11:47:06 PM UTC-8, Jim Wong wrote:
Sorry for the double-post, and apologies in advance if posting attachments to the group is verboten.

Here are graphs that show two values for the last week:

- merges-current.jpg shows the number of current merges as reported by ES. Note that this number stayed low after midday Friday (when we "fixed" things the last time) through sometime Sunday, when things started to go downhill.

- merges-total.jpg shows the rate of merges reported by ES, calculated by subtracting the total merges metric at repeating time intervals.

Lining these two graphs up, it looks like things are good Friday and Saturday: we're merging quite a bit, but not that many are active at any given time, which seems to indicate things are going fast.

Starting on Sunday, the number of concurrent merges starts climbing, and the total number of merges completed drops way off.  Our conclusion is that merges have started taking longer for some reason, which leads to more running at the same time.

You can disregard the changes at the end of the day Wednesday - that's when we started fixing things by restarting nodes. If the previous pattern holds, things will be OK again for a couple days.

We also have things set up with debug logging so that ES prints a message like this when a merge completes and takes more than 20 seconds:

[2012-12-15 12:59:36,520][DEBUG][index.merge.scheduler    ] [Matsu'o Tsurayaba] [i26][22] merge [_6jeg] done, took [1.3m]

Picking a node at random, here are the number of slow merges on Saturday (healthy), Monday (unhealthy, but still working) and Wednesday (going down the drain):

Saturday: 7
Monday: 15
Wednesday: 22 

Based on this information, the number of slow merges seems to be increasing, while the total number of merges completed seems to be decreasing.

Still, I/O and CPU seem reasonably healthy:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.84    0.00    1.90    1.19    0.00   73.07

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.67  0.00  1.67     0.00     0.01    11.20     0.00    0.60   0.20   0.03
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00     0.67  0.00  1.67     0.00     0.01    11.20     0.00    0.60   0.20   0.03
sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00  1993.00 19.67 1207.67     0.15    12.47    21.07     4.40    3.18   0.19  22.93


On Wednesday, December 19, 2012 11:20:14 PM UTC-8, Jim Wong wrote:
Thanks for the response. We'll check out that link.

Can you elaborate on where you think the bottleneck is that's leading to this problem?

Under normal circumstances (the first couple days after the cluster comes up), there don't seem to be any obvious resource constraints arising from the large number of shards: CPU utilization is very reasonable (perhaps around 30%), and I/O utilization seems reasonable, as well (under 40%). GC logging seems to indicate that we're not having issues on that front.  Everything seems fine and dandy for a couple days.

Over time, however, _something_ seems to degrade: the number of merges the system completes decreases dramatically, and the number of slow merges increases (based on the debug logs ES produces). This is an interesting period, because the cluster outwardly seems to be healthy, but we can tell from the metrics we're collecting that things are starting go downhill. CPU and disk I/O still seem fine, but for some reason there seems to be a problem with merges.

Eventually - a few days after things start to slow down - the number of threads in use skyrockets, the heap gets completely consumed, and nodes start dropping out of the cluster. This is the end stage of the disease: we've hit a tipping point and are dead.

We have some numbers (and maybe graphs) we might be able to share that illustrate what's happening. To solve the problem, we'd like to understand what resource we're low on before we jump into re-architecting the cluster, lest we run into a similar problem again.

In the meantime, we are going to try adding another 32 GB of RAM to one of our nodes while leaving the heap size unchanged to see if it changes the pattern.  I'm a little pessimistic on this front because I/O doesn't seem to be a huge bottleneck right now (based on the output of iostat), but it does seem possible that just fractionally improving merge performance through additional buffer space could help if we are very slowly falling behind.


On Wednesday, December 19, 2012 3:55:17 PM UTC-8, RKM wrote:
This would seem to be relevant:


Elastic is simply building on top of lucene. You can "tweak" the merge factor but to get good results but you will need to understand the extensive tradeoffs involved. However, its a slight investment compared to what you have done so far.

Still, looking at your system from a high-level I believe that merge factors will be a band-aid and the only real relief is to re-architect such that you have a "reasonable" number of shards per node. For example, if are really needing 24gb of jvm heap then you probably need (at least) 48gb of memory per node. Etc.

On Wed, Dec 19, 2012 at 8:30 AM, Jim Wong <[hidden email]> wrote:
I work with Art and Ethan. A little background: we didn't target a specific number of shards per server; instead we divided our user base into partitions, and broke each partition up into a fixed number of shards, with the built-in assumption that the indexes will get much, much bigger in the future.

No, no single query hits every shard.

Just to restate some of the problem, things perform fine for a few days after the cluster is started up.  Then, we notice the number of current merges (as reported by ES) start to grow until finally nodes start dropping out and everything falls apart. We are not sure whether the current merges metric is the cause of the problem or an effect.

The question we are wrestling with is, what changes between the first 2-3 days,when the number of merges seems relatively stable and the cluster is performing great, and the next several days, when the number of merges jumps up and everything falls apart.

During this time, heap usage in the JVM looks good and disk I/O looks fine - the utilization percentage reported by iostat is under 50%. CPU seems to increase with the number of active merges, but doesn't seem to max out.

Thoughts? We're actually in a period right now in which the number of merges has started to climb, but no one has dropped out of the cluster (yet). The pattern is consistent enough that we can look at the merge activity graphs and anticipate when the next failure will occur.


On Tuesday, December 18, 2012 9:24:11 PM UTC-8, Otis Gospodnetic wrote:
Hi,

I didn't carefully read this thread, but that seems like a high number of shards/server.  Does a single query hit all of them?  If so, that's a biiiig red flag.

How did you come up with "32 shards" number?  Why not 22?  Or 19?

A quick way of testing whether you have too many shards and if that's the source of the problem is to create a test cluster somewhere, take something like a Wikipedia dump, index it into an index with a high number of shards and do a performance test while observing various performance metrics (see my signature).  Then drop the index and create a new one with half the shards and repeat.  This will give you the feel for how much impact a high number of shards has on perf.

Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, December 18, 2012 6:08:34 PM UTC-5, Ethan Shrago wrote:
Sorry, just to clarify...each index has only 32 shards with one replica, but we have 250 shards per host.

Cluster in a nutshell:
Total combined index size: 23.4 TB

-19 data nodes
- 79 indices
- 32 shards + 1 replica shard = 5056 total shards in cluster
- 5056 shareds / 19 data nodes = approx. 266 shards/node (we allocate shards evenly and randomly across the cluster, so any index will have shards that span across most all of the hosts in the cluster)

For each node:

16 cores (16xIntel(R) Xeon(R) CPU L5520 @ 2.27GHz)
32GB memory (8X 4096 MB/1333 MHz/DIMM/DDR3),

Local OS has 2x250GBx7200RPM in a mirrored RAID.

Separate RAID-10 for ES index data - Seagate 15K 450GB, total logical drive size 2.7TB

Leading up the the node drop, we did receive ping responses, but since they were past the timeout threshold we have set (120 seconds), they were disregarded and the node got removed due to a failed heartbeat...this was happening for a couple of hours leading up to the node dropping, so likely what happened is that a couple of ping requests would timeout, but one would get through before it timed out:

[2012-12-10 07:14:52,411][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [267929ms] ago, timed out [147928ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [56024100]

[2012-12-10 11:45:44,488][WARN ][transport                ] [Farallah] Received response for a request that has timed out, sent [193187ms] ago, timed out [73186ms] ago, action [discovery/zen/fd/ping], node [[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]], id [57862298]

[2012-12-10 11:51:45,528][INFO ][cluster.service          ] [Farallah] removed {[Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]],}, reason: zen-disco-node_failed([Mark Gervaisnight shade][QHjSz782RqOACkEc5IRmCg][inet[/10.5.120.88:9300]]), reason failed to ping, tried [3] times, each with maximum [2m] timeout




On Tuesday, December 18, 2012 2:23:16 PM UTC-8, RKM wrote:
Still not sure what each node looks like? Cores, memory, disc?

Can you succinctly summarize for the group what your cluster looks like?

I understand that for logical reasons you have many indices, but why does each index have 250 shards?

You saw that merging was happening around the time of the node drop--but can you tell us *why* it dropped exactly? Did it fail to respond to a heartbeat? Did the wrapper kill elasticsearch? There might be more than one problem here.

Thanks,
Randy

On Thu, Dec 13, 2012 at 12:24 PM, arta <[hidden email]> wrote:
Hi,
I think I am seeing node performance degrades when the merge count
("merges":{"current": x, ...}} in node stats) increases to somewhere 6 to
10, and that results the node drop from the cluster.

I enabled index.merge DEBUG log then I see following logs for each shard:

[2012-12-12 18:29:44,136][DEBUG][index.merge.policy       ] [Scourge of the
Underworld] [i11][25] using [tiered] merge policy with
expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10],
max_merge_at_once_explicit[30], max_merged_segment[5gb],
segments_per_tier[10.0], reclaim_deletes_weight[2.0], async_merge[true]
[2012-12-12 18:29:44,136][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i11][25] using [concurrent] merge scheduler with
max_thread_count[3]

I'm using default merge policy on each node with 16 cores.
num_replicas is 1.
Each node has around 250 shards including primary and secondary shards.
ES is 0.19.3.

This morning I encountered the node drop, and at that time, the node dropped
was doing many merges, according to the log. The time used to each merge
seems increasing when the node dropped.
It started with relatively short time like this:

[2012-12-13 07:43:07,235][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i13][27] merge [_72z9] done, took [54.8s]
(the size of this shard is 4.2GB)

Then the time increased to minutes (7:50 is the time when this node dropped
from the cluster):

[2012-12-13 07:50:30,396][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i15][3] merge [_16sc3] done, took [4.8m]
(the size of this shard is 5.8GB)

It kept increasing:

[2012-12-13 07:57:17,056][DEBUG][index.merge.scheduler    ] [Scourge of the
Underworld] [i531][22] merge [_6293] done, took [11m]
(the size of this shard is 3.6GB)

I'm not sure this is related to my problem (node drop), but my gut feeling
is I might need some tuning on the merge policy.

Any hints or guidance would be very much appreciated.
Thank you for advance.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/merge-policy-tuning-hint-tp4026976.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--



--
 
 

--
 
 

--