High CPU Load on only some of the machines in a cluster

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

High CPU Load on only some of the machines in a cluster

Chris.Rode
Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?


Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

kimchy
Administrator
Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:

Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?



Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Chris.Rode
Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <[hidden email]> wrote:

> Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
>
>
>
>
>
>
> On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:
> > Hi all
>
> > I have not been able to find anything to explain why this may be occurring..
>
> > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Chris.Rode
UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm, Chris Rode <[hidden email]> wrote:

> Hey Shay
>
> Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> head I can see that each index is not spread evenly. Trying to use
> index.routing.allocation.total_shards_per_node (set to 15) transiently
> now...
>
> Anything else I should try?
>
> On Feb 21, 12:12 am, Shay Banon <[hidden email]> wrote:
>
>
>
>
>
>
>
> > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:
> > > Hi all
>
> > > I have not been able to find anything to explain why this may be occurring..
>
> > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

jacque74
The thing to do is to do a jstack on the alas. search process.  (Use
paste bin to report link here). Do it during load.  We will be able to
tell which is happening.  So far I noted at least few major overload
conditions.  1). When index is cold and you are doing bulk inserts,
you may get bloom filter load to kill your disk IO?  (Shay, is there
way to pace this somehow, the IOwait and waiting threads go to
500-1000).  2).  You are doing massive merges or flushes, potentially
your refresh is not set to -1 when you are indexing.  3).  Your disks
are too slow, check iostat -xdm 1 report for 30 seconds.   4). You are
running heap memory. (check log for oome msgs).

-Jack

On Feb 21, 3:32 pm, Chris Rode <[hidden email]> wrote:

> UPDATE: Unfortunately applying this transiently baulked.. however I
> have applied manually to each node in the cluster. After the dust
> settled, the shards still aren't distributed fairly.
> We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> = 12 so this should be fine) and some nodes still have 20 shards on an
> index.
>
> We are using 18.6 is this an 18.7 only setting by any chance?
>
> On Feb 21, 5:16 pm, Chris Rode <[hidden email]> wrote:
>
>
>
>
>
>
>
> > Hey Shay
>
> > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > head I can see that each index is not spread evenly. Trying to use
> > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > now...
>
> > Anything else I should try?
>
> > On Feb 21, 12:12 am, Shay Banon <[hidden email]> wrote:
>
> > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:
> > > > Hi all
>
> > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Chris.Rode
Thanks for your offer to help jack

http://pastebin.com/ZFHJzX9Y

That is the Jstack trace you asked for. We have 4 nodes now on
m1.xlarge with an mdadm raid 0 across the 4 drives.

We are doing indexing whilst we run the load as well as straight
queries. We havnt turned off the refresh but have have turned it down
to 10seconds

gateway.index_refresh: 10

I have also skitched a copy of the cluster allocation tables in
elasticsearch-head

https://skitch.com/christopherrode/8rcdc/elasticsearch-head

Any help you can give would be much appreciated

Cheers

On Feb 22, 2:08 pm, Jack Levin <[hidden email]> wrote:

> The thing to do is to do a jstack on the alas. search process.  (Use
> paste bin to report link here). Do it during load.  We will be able to
> tell which is happening.  So far I noted at least few major overload
> conditions.  1). When index is cold and you are doing bulk inserts,
> you may get bloom filter load to kill your disk IO?  (Shay, is there
> way to pace this somehow, the IOwait and waiting threads go to
> 500-1000).  2).  You are doing massive merges or flushes, potentially
> your refresh is not set to -1 when you are indexing.  3).  Your disks
> are too slow, check iostat -xdm 1 report for 30 seconds.   4). You are
> running heap memory. (check log for oome msgs).
>
> -Jack
>
> On Feb 21, 3:32 pm,ChrisRode<[hidden email]> wrote:
>
>
>
>
>
>
>
> > UPDATE: Unfortunately applying this transiently baulked.. however I
> > have applied manually to each node in the cluster. After the dust
> > settled, the shards still aren't distributed fairly.
> > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > = 12 so this should be fine) and some nodes still have 20 shards on an
> > index.
>
> > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > On Feb 21, 5:16 pm,ChrisRode<[hidden email]> wrote:
>
> > > Hey Shay
>
> > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > head I can see that each index is not spread evenly. Trying to use
> > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > now...
>
> > > Anything else I should try?
>
> > > On Feb 21, 12:12 am, Shay Banon <[hidden email]> wrote:
>
> > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:
> > > > > Hi all
>
> > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

kimchy
Administrator
In reply to this post by Chris.Rode
The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?

On Wednesday, February 22, 2012 at 1:32 AM, Chris Rode wrote:

UPDATE: Unfortunately applying this transiently baulked.. however I
have applied manually to each node in the cluster. After the dust
settled, the shards still aren't distributed fairly.
We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
= 12 so this should be fine) and some nodes still have 20 shards on an
index.

We are using 18.6 is this an 18.7 only setting by any chance?

On Feb 21, 5:16 pm, Chris Rode <cir...@gmail.com> wrote:
Hey Shay

Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
head I can see that each index is not spread evenly. Trying to use
index.routing.allocation.total_shards_per_node (set to 15) transiently
now...

Anything else I should try?

On Feb 21, 12:12 am, Shay Banon <kim...@gmail.com> wrote:







Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?

On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:
Hi all

I have not been able to find anything to explain why this may be occurring..

We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.

We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick

Any ideas?

Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Chris.Rode
Sorry Shay

The image attached via skitch shows the cluster and distribution at
the time of the Jstack trace

We have 4 nodes, with 3 indexes and a replication factor of 3. Each
index has 20 shards. These are running across two availability zones
on amazon.

When we turn down the replication factor the nodes get an even number
of shards, however they do not get an even number of shards per index.
This means that indexes with higher load do not utilise the entire
clusters resources

We are currently using 0.18.7



On Feb 27, 6:22 am, Shay Banon <[hidden email]> wrote:

> The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?
>
>
>
>
>
>
>
> On Wednesday, February 22, 2012 at 1:32 AM, Chris Rode wrote:
> > UPDATE: Unfortunately applying this transiently baulked.. however I
> > have applied manually to each node in the cluster. After the dust
> > settled, the shards still aren't distributed fairly.
> > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > = 12 so this should be fine) and some nodes still have 20 shards on an
> > index.
>
> > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > On Feb 21, 5:16 pm, Chris Rode <[hidden email] (http://gmail.com)> wrote:
> > > Hey Shay
>
> > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > head I can see that each index is not spread evenly. Trying to use
> > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > now...
>
> > > Anything else I should try?
>
> > > On Feb 21, 12:12 am, Shay Banon <[hidden email] (http://gmail.com)> wrote:
>
> > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > On Monday, February 20, 2012 at 6:01 AM, Chris Rode wrote:
> > > > > Hi all
>
> > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Chris.Rode
This new skitch illustrates what I was saying about the index-node
balancing behavior

https://skitch.com/christopherrode/8r9xr/elasticsearch-head



On Feb 27, 10:25 am, Chris Rode <[hidden email]> wrote:

> Sorry Shay
>
> The image attached via skitch shows the cluster and distribution at
> the time of the Jstack trace
>
> We have 4 nodes, with 3 indexes and a replication factor of 3. Each
> index has 20 shards. These are running across two availability zones
> on amazon.
>
> When we turn down the replication factor the nodes get an even number
> of shards, however they do not get an even number of shards per index.
> This means that indexes with higher load do not utilise the entire
> clusters resources
>
> We are currently using 0.18.7
>
> On Feb 27, 6:22 am, Shay Banon <[hidden email]> wrote:
>
>
>
>
>
>
>
> > The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?
>
> > On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:
> > > UPDATE: Unfortunately applying this transiently baulked.. however I
> > > have applied manually to each node in the cluster. After the dust
> > > settled, the shards still aren't distributed fairly.
> > > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > > = 12 so this should be fine) and some nodes still have 20 shards on an
> > > index.
>
> > > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > > On Feb 21, 5:16 pm,ChrisRode<[hidden email] (http://gmail.com)> wrote:
> > > > Hey Shay
>
> > > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > > head I can see that each index is not spread evenly. Trying to use
> > > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > > now...
>
> > > > Anything else I should try?
>
> > > > On Feb 21, 12:12 am, Shay Banon <[hidden email] (http://gmail.com)> wrote:
>
> > > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > > On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:
> > > > > > Hi all
>
> > > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Berkay Mollamustafaoglu-2
Chris, 

ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load. 
Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <[hidden email]> wrote:
This new skitch illustrates what I was saying about the index-node
balancing behavior

https://skitch.com/christopherrode/8r9xr/elasticsearch-head



On Feb 27, 10:25 am, Chris Rode <[hidden email]> wrote:
> Sorry Shay
>
> The image attached via skitch shows the cluster and distribution at
> the time of the Jstack trace
>
> We have 4 nodes, with 3 indexes and a replication factor of 3. Each
> index has 20 shards. These are running across two availability zones
> on amazon.
>
> When we turn down the replication factor the nodes get an even number
> of shards, however they do not get an even number of shards per index.
> This means that indexes with higher load do not utilise the entire
> clusters resources
>
> We are currently using 0.18.7
>
> On Feb 27, 6:22 am, Shay Banon <[hidden email]> wrote:
>
>
>
>
>
>
>
> > The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?
>
> > On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:
> > > UPDATE: Unfortunately applying this transiently baulked.. however I
> > > have applied manually to each node in the cluster. After the dust
> > > settled, the shards still aren't distributed fairly.
> > > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > > = 12 so this should be fine) and some nodes still have 20 shards on an
> > > index.
>
> > > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > > On Feb 21, 5:16 pm,ChrisRode<[hidden email] (http://gmail.com)> wrote:
> > > > Hey Shay
>
> > > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > > head I can see that each index is not spread evenly. Trying to use
> > > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > > now...
>
> > > > Anything else I should try?
>
> > > > On Feb 21, 12:12 am, Shay Banon <[hidden email] (http://gmail.com)> wrote:
>
> > > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > > On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:
> > > > > > Hi all
>
> > > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > > Any ideas?

Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

kimchy
Administrator
As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).

On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote:

Chris, 

ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load. 
Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <[hidden email]> wrote:
This new skitch illustrates what I was saying about the index-node
balancing behavior

https://skitch.com/christopherrode/8r9xr/elasticsearch-head



On Feb 27, 10:25 am, Chris Rode <[hidden email]> wrote:
> Sorry Shay
>
> The image attached via skitch shows the cluster and distribution at
> the time of the Jstack trace
>
> We have 4 nodes, with 3 indexes and a replication factor of 3. Each
> index has 20 shards. These are running across two availability zones
> on amazon.
>
> When we turn down the replication factor the nodes get an even number
> of shards, however they do not get an even number of shards per index.
> This means that indexes with higher load do not utilise the entire
> clusters resources
>
> We are currently using 0.18.7
>
> On Feb 27, 6:22 am, Shay Banon <[hidden email]> wrote:
>
>
>
>
>
>
>
> > The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?
>
> > On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:
> > > UPDATE: Unfortunately applying this transiently baulked.. however I
> > > have applied manually to each node in the cluster. After the dust
> > > settled, the shards still aren't distributed fairly.
> > > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > > = 12 so this should be fine) and some nodes still have 20 shards on an
> > > index.
>
> > > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > > On Feb 21, 5:16 pm,ChrisRode<[hidden email] (http://gmail.com)> wrote:
> > > > Hey Shay
>
> > > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > > head I can see that each index is not spread evenly. Trying to use
> > > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > > now...
>
> > > > Anything else I should try?
>
> > > > On Feb 21, 12:12 am, Shay Banon <[hidden email] (http://gmail.com)> wrote:
>
> > > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > > On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:
> > > > > > Hi all
>
> > > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > > Any ideas?


Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Nick Dimiduk
Hi,

I am having a similar experience. In my case, 1 node is pegged while
others have only 25-50% CPU utilization. That over-burdened node is
hosting 4 index shards while the others have 3; it is not the master.
None of the nodes are swapping.

Configuration.

Running v0.18.7. I have only a single index: 5 shards, 1 replica. ES
Head reports it as ~1mil documents and ~1.8g in size. The index is
configured to be entirely in memory with the S3 gateway. ES_MAX_MEM is
at 4g. I'm running it on a 3-node cluster of m1.large instances.

Workload.

I started with a bulk-insert of all the data. After that, a couple
schema tweaks resulted in additional loads of a portion of he data
set. The querying being run is "match_all" with a "geo_distance"
filter applied. At this point no other fields are being filtered. No
writes are happening durring this read test.

Why is this single machine the bottle-neck? How can I go about
distributing this load more evenly? Is there a minimum cluster size or
specific cluster configuration I should be using for this kind of
load? I'm curious how I can even go about diagnosing the issue.

Thanks,
Nick

On Feb 27, 5:09 am, Shay Banon <[hidden email]> wrote:

> As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).
>
>
>
>
>
>
>
> On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote:
> > Chris,
>
> > ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load.
> > Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.
>
> > Regards,
> > Berkay Mollamustafaoglu
> > mberkay on yahoo, google and skype
>
> > On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <[hidden email] (mailto:[hidden email])> wrote:
> > > This new skitch illustrates what I was saying about the index-node
> > > balancing behavior
>
> > >https://skitch.com/christopherrode/8r9xr/elasticsearch-head
>
> > > On Feb 27, 10:25 am, Chris Rode <[hidden email] (mailto:[hidden email])> wrote:
> > > > Sorry Shay
>
> > > > The image attached via skitch shows the cluster and distribution at
> > > > the time of the Jstack trace
>
> > > > We have 4 nodes, with 3 indexes and a replication factor of 3. Each
> > > > index has 20 shards. These are running across two availability zones
> > > > on amazon.
>
> > > > When we turn down the replication factor the nodes get an even number
> > > > of shards, however they do not get an even number of shards per index.
> > > > This means that indexes with higher load do not utilise the entire
> > > > clusters resources
>
> > > > We are currently using 0.18.7
>
> > > > On Feb 27, 6:22 am, Shay Banon <[hidden email] (mailto:[hidden email])> wrote:
>
> > > > > The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?
>
> > > > > On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:
> > > > > > UPDATE: Unfortunately applying this transiently baulked.. however I
> > > > > > have applied manually to each node in the cluster. After the dust
> > > > > > settled, the shards still aren't distributed fairly.
> > > > > > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > > > > > = 12 so this should be fine) and some nodes still have 20 shards on an
> > > > > > index.
>
> > > > > > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > > > > > On Feb 21, 5:16 pm,ChrisRode<[hidden email] (mailto:[hidden email]) (http://gmail.com)> wrote:
> > > > > > > Hey Shay
>
> > > > > > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > > > > > head I can see that each index is not spread evenly. Trying to use
> > > > > > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > > > > > now...
>
> > > > > > > Anything else I should try?
>
> > > > > > > On Feb 21, 12:12 am, Shay Banon <[hidden email] (mailto:[hidden email]) (http://gmail.com)> wrote:
>
> > > > > > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > > > > > On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:
> > > > > > > > > Hi all
>
> > > > > > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > > > > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Chris.Rode
In reply to this post by kimchy
Thanks for that Shay

(I was wondering why that setting wasnt taking effect on 0.18.7 :)).
That setting will help distribute the shards in the multi-index case.
For now we have simply combined our indexes. We are now down to 2 and
it is more evently distributed due to that change

We ended up getting the cluster to cope with the amount of reads by
switching on the caching for certain filters. So far this is working
for us... although before the cache warms we still get the problem
that Nick Mentions below, where one or two machines get high load and
the rest dont show any.. this disapears as the cache gets populated

On Feb 28, 12:09 am, Shay Banon <[hidden email]> wrote:

> As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).
>
>
>
>
>
>
>
> On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote:
> > Chris,
>
> > ES does not distribute shards to accomplish " even number of shards per index". It distributes to have even number of shards per node in total. Currently in ES, when you have indices with drastically different sizes, you end up with unbalanced load.
> > Why do you have a separate index with 20 shards for the index with 106 documents, is it supposed to grow? The easiest way to resolve the problem is to have only a single index with different types for 2nd and 3rd index, rather than having separate indices.
>
> > Regards,
> > Berkay Mollamustafaoglu
> > mberkay on yahoo, google and skype
>
> > On Sun, Feb 26, 2012 at 7:36 PM, Chris Rode <[hidden email] (mailto:[hidden email])> wrote:
> > > This new skitch illustrates what I was saying about the index-node
> > > balancing behavior
>
> > >https://skitch.com/christopherrode/8r9xr/elasticsearch-head
>
> > > On Feb 27, 10:25 am, Chris Rode <[hidden email] (mailto:[hidden email])> wrote:
> > > > Sorry Shay
>
> > > > The image attached via skitch shows the cluster and distribution at
> > > > the time of the Jstack trace
>
> > > > We have 4 nodes, with 3 indexes and a replication factor of 3. Each
> > > > index has 20 shards. These are running across two availability zones
> > > > on amazon.
>
> > > > When we turn down the replication factor the nodes get an even number
> > > > of shards, however they do not get an even number of shards per index.
> > > > This means that indexes with higher load do not utilise the entire
> > > > clusters resources
>
> > > > We are currently using 0.18.7
>
> > > > On Feb 27, 6:22 am, Shay Banon <[hidden email] (mailto:[hidden email])> wrote:
>
> > > > > The setting only applies in 0.19, but you did not answer the question with a bit more detailed information (how many indices, how many shards, are they not evenly balanced)?
>
> > > > > On Wednesday, February 22, 2012 at 1:32 AM,ChrisRodewrote:
> > > > > > UPDATE: Unfortunately applying this transiently baulked.. however I
> > > > > > have applied manually to each node in the cluster. After the dust
> > > > > > settled, the shards still aren't distributed fairly.
> > > > > > We have set the setting to 13 (20(shards)*4(original+copies)/7(nodes)
> > > > > > = 12 so this should be fine) and some nodes still have 20 shards on an
> > > > > > index.
>
> > > > > > We are using 18.6 is this an 18.7 only setting by any chance?
>
> > > > > > On Feb 21, 5:16 pm,ChrisRode<[hidden email] (mailto:[hidden email]) (http://gmail.com)> wrote:
> > > > > > > Hey Shay
>
> > > > > > > Yes we do, we have 3, each with 20 shards. Looking with elasticsearch-
> > > > > > > head I can see that each index is not spread evenly. Trying to use
> > > > > > > index.routing.allocation.total_shards_per_node (set to 15) transiently
> > > > > > > now...
>
> > > > > > > Anything else I should try?
>
> > > > > > > On Feb 21, 12:12 am, Shay Banon <[hidden email] (mailto:[hidden email]) (http://gmail.com)> wrote:
>
> > > > > > > > Do you have multiple indices in the cluster? If so, maybe you end up with certain indices that only on a subset of the cluster?
>
> > > > > > > > On Monday, February 20, 2012 at 6:01 AM,ChrisRodewrote:
> > > > > > > > > Hi all
>
> > > > > > > > > I have not been able to find anything to explain why this may be occurring..
>
> > > > > > > > > We have a cluster of 7 nodes on ephemeral EC2 with an S3 gateway set to 10 minutes. When we put the cluster under load though, only 3 of the 7 nodes get any load on them, and that quickly builds up to a point where queries are timing out.
>
> > > > > > > > > We originally had a replication factor of 3 set, and have just updated it to 6 (shard count is 20 per index), however this does not seem to have done the trick
>
> > > > > > > > > Any ideas?
Reply | Threaded
Open this post in threaded view
|

Re: High CPU Load on only some of the machines in a cluster

Frederic
In reply to this post by kimchy
Hi there,

For what it worths, I have the same issue here. We upgraded our cluster to 0.18.7 (6 servers, 1 index, 20 shards, 1 replica) two days ago and set max and min heap as 4GB. Afterwards we started to index 50G of docs at about 150 docs/sec, and one of the servers showed really high loads (between 20 and 40) and CPU usage was 200% in a 4 cores CPU. After restarting that server, other server showed similar behavior.

Thanks,
Frederic


On Monday, 27 February 2012 10:09:39 UTC-3, kimchy wrote:
As Berkay mentioned, the balancing theme is across indices, though I have started to work on smarter one that will do better cross indices case. In 0.19, I added a bandaid (for now) where you can force, on the index level, max number of shards per node, so you can force even balancing (take total_number_of_shards / number_of_nodes, where total_number_of_shards is (1+number_of_replicas) * number_of_shards).

On Monday, February 27, 2012 at 5:12 AM, Berkay Mollamustafaoglu wrote: