Node experiencing relatively high CPU usage

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Node experiencing relatively high CPU usage

Nitish Sharma
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Igor Motov-3
Run jstack on the node that is using 600-700% of CPU and let's see what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
Run jstack on the node that is using 600-700% of CPU and let's see what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Igor Motov-3
It looks like this node is quite busy updating documents. Is it possible that your indexing load is concentrated on the shards that just happened to be located on this particular node? 


On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
Run jstack on the node that is using 600-700% of CPU and let's see what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
We are, indeed, running a lot of "update" operations continuously but they are not routed to specific shards. The document to be updated can be present on any of the shards (on any of the nodes). And, as I mentioned, all shards are uniformly distributed across nodes. 

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
It looks like this node is quite busy updating documents. Is it possible that your indexing load is concentrated on the shards that just happened to be located on this particular node? 


On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
Run jstack on the node that is using 600-700% of CPU and let's see what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Igor Motov-3
Interesting.  Did you try to run curl "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
We are, indeed, running a lot of "update" operations continuously but they are not routed to specific shards. The document to be updated can be present on any of the shards (on any of the nodes). And, as I mentioned, all shards are uniformly distributed across nodes. 

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
It looks like this node is quite busy updating documents. Is it possible that your indexing load is concentrated on the shards that just happened to be located on this particular node? 


On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
Run jstack on the node that is using 600-700% of CPU and let's see what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Hi Igor,
I checked the stats and elasticsearch-head also confirmed that each node has equal number of shards. Moreover, interestingly, this weekend this behaviour (of constant high CPU usage) was taken over by another node and the node previously over-using CPU is now more or less *normal*. So, as far as I observed it, at any given point of time (atleast) 1 node would be doing *a lot* of pure-CPU, while other nodes are fairly quiet. Weird!
We are not indexing documents with routing, neither updating them using routes.
Any other pointers?

Cheers
Nitish

On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
Interesting.  Did you try to run curl "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal distribution of indexing operations is really the case?

On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
We are, indeed, running a lot of "update" operations continuously but they are not routed to specific shards. The document to be updated can be present on any of the shards (on any of the nodes). And, as I mentioned, all shards are uniformly distributed across nodes. 

On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
It looks like this node is quite busy updating documents. Is it possible that your indexing load is concentrated on the shards that just happened to be located on this particular node? 


On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
Hi Igor,
I couldnt make any sense out of the jstack's dump (2000 lines long). May be you can help - http://pastebin.com/u57QB7ra?

Cheers
Nitish

On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
Run jstack on the node that is using 600-700% of CPU and let's see what it's doing.

On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
HI,
We have a 5-node ES cluster. On one particular node ES process is consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage is always below 100%. We are running 0.19.8 and each node has equal number of shards.
Any suggestions?

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Stéphane Raux
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

kimchy
Administrator
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish

Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
Node1: http://imgur.com/x1bpy
Node2: http://imgur.com/sNhBQ
Node3: http://imgur.com/izrNA
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish

Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

kimchy
Administrator
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler    ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Igor Motov-3
Yes, I can see how constantly trying to merge segments and failing at it can cause abnormal I/O load. Has this cluster ever run out of disk space or memory while it was indexing? 

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler    ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Q9SVk1HgjtkJ">sharmani...@...> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Q9SVk1HgjtkJ">sharmani...@...> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Q9SVk1HgjtkJ">sharmani...@...>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Yeah, at some point couple of nodes ran out of memory. We recovered the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This particular "_25fy9" segment seems to be the only failed segment. Any way to start this shard fresh even if it means losing data in this particular segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:
Yes, I can see how constantly trying to merge segments and failing at it can cause abnormal I/O load. Has this cluster ever run out of disk space or memory while it was indexing? 

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler    ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Nitish Sharma
Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:
Yeah, at some point couple of nodes ran out of memory. We recovered the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This particular "_25fy9" segment seems to be the only failed segment. Any way to start this shard fresh even if it means losing data in this particular segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:
Yes, I can see how constantly trying to merge segments and failing at it can cause abnormal I/O load. Has this cluster ever run out of disk space or memory while it was indexing? 

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler    ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Igor Motov-3
I would try to shutdown es, backup all files in the shard index directory and run Lucene CheckIndex tool there. I never had to run it on elasticsearch indices, but since they are Lucene indices, it might just work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:
Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:
Yeah, at some point couple of nodes ran out of memory. We recovered the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This particular "_25fy9" segment seems to be the only failed segment. Any way to start this shard fresh even if it means losing data in this particular segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:
Yes, I can see how constantly trying to merge segments and failing at it can cause abnormal I/O load. Has this cluster ever run out of disk space or memory while it was indexing? 

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler    ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Node experiencing relatively high CPU usage

Sebastian Lehn
Has anybody tried to repair like Igor advised?

Am Dienstag, 21. August 2012 19:04:19 UTC+2 schrieb Igor Motov:
I would try to shutdown es, backup all files in the shard index directory and run Lucene CheckIndex tool there. I never had to run it on elasticsearch indices, but since they are Lucene indices, it might just work.

On Tuesday, August 21, 2012 11:37:00 AM UTC-4, Nitish Sharma wrote:
Anyone got an idea about how to recover the shard?

On Friday, August 17, 2012 1:06:00 PM UTC+2, Nitish Sharma wrote:
Yeah, at some point couple of nodes ran out of memory. We recovered the nodes by completely stopping the offending application.
Is there anyway to recover from this segment merge failure? This particular "_25fy9" segment seems to be the only failed segment. Any way to start this shard fresh even if it means losing data in this particular segment?

Cheers
Nitish

On Friday, August 17, 2012 4:15:41 AM UTC+2, Igor Motov wrote:
Yes, I can see how constantly trying to merge segments and failing at it can cause abnormal I/O load. Has this cluster ever run out of disk space or memory while it was indexing? 

On Thursday, August 16, 2012 12:51:40 PM UTC-4, Nitish Sharma wrote:
More information - On these 2 particular nodes, we continuously get these warnings:

[2012-08-16 18:48:05,313][WARN ][index.merge.scheduler    ] [node5] [rolling_index][3] failed to merge
java.io.EOFException: read past EOF: NIOFSIndexInput(path="/var/lib/elasticsearch/skl-elasticsearch/nodes/0/indices/rolling_index/3/index/_25zy9.fdt")
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:155)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:111)
at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:132)
at org.elasticsearch.index.store.Store$StoreIndexOutput.copyBytes(Store.java:661)
at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:228)
at org.apache.lucene.index.SegmentMerger.copyFieldsWithDeletions(SegmentMerger.java:266)
at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:223)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:107)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4256)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3901)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:91)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)

These warnings are related to shard number 3 and both copies reside on these 2 nodes. Could it be possible that abnormal (IO load) behaviour of these 2 nodes is because of corruption of shard 3?

Cheers
Nitish
On Thursday, August 16, 2012 5:00:14 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy, Igor
Considering all this high load on 1 particular node, we added 5 more nodes to our cluster thinking that load would be distributed. We also stopped all update operations. After running stable for like 1 week, today suddenly 2 out of 10 nodes started acting up. They have exceptionally high IO wait time and thus high load. Subsequently, increasing query execution time. Note that we are not doing any update operations; only simple indexing.

Jstack of a node with normal load: http://pastebin.com/vYmE8dZe
Jstack of the node with high IO load: http://pastebin.com/5xUeBqZU
Jstack of another node with high IO load: http://pastebin.com/Rafi3Fbk

All nodes run Java 7, Ubuntu 12.04 and JVM 23.1-b03. 
It would be great if some pointers can be provided to track down the problem.

Cheers
Nitish

On Thursday, August 2, 2012 7:16:32 PM UTC+2, Nitish Sharma wrote:
Hi Kimchy,
I just updated all nodes to use Java 7; though still using Ubuntu 10.04.
Java version: 1.7.0_05
JVM: 23.1-b03

The problems still persist:
 - 1 out of nodes is still using a *lot* of CPU and garbage collecting heap-memory almost every minute.
 - Bigdesk shows that 1 node is not receiving any GET requests (we have continuous update operations going on).

Any more suggestions? :/
 

On Thursday, August 2, 2012 2:14:42 PM UTC+2, kimchy wrote:
Can you try and upgrade to a newer JVM, the one you use are pretty old. If you want to use 1.6, then make sure its a recent update (like update 33), and make sure you have the same JVM across all nodes.

Also, if you are up for it, newer Ubuntu versions (there is a new LTS as well) is recommended.

On Jul 31, 2012, at 2:31 PM, Nitish Sharma <[hidden email]> wrote:

Hi Kimchy,
Following are the jstacks:
ES Node1 - CPU Usage 100-200%: https://gist.github.com/3216175
ES Node2 (offending node) - CPU Usage 600-700%: https://gist.github.com/3216198
ES Node3 - CPU Usage 100-200%: https://gist.github.com/3216200

ES Version: 0.19.8
OS: Ubuntu 10.04.4 LTS
JVM Versions: 20.0-b12 and 19.0-b09
All nodes are physical machines with 24 GB RAM, 8-core CPU.

Another observation - Bigdesk shows that there are no GET requests coming to node1, which is kind of weird since HAProxy balances all requests in round-robin fashion.
The problem is not just CPU usage of node2 but also heap memory usage. Because excessive and fast heap memory usage, GC are so often that node2 heavily skews our search performance.
Following are the heap graphs:
On Tuesday, July 31, 2012 7:44:37 AM UTC+2, kimchy wrote:
Can you jstack another node, lets see if its doing any work as well. Which ES version are you using? Also, JVM version, OS version, and are you running on a virtual env or not?

On Jul 31, 2012, at 1:47 AM, Nitish Sharma <[hidden email]> wrote:

We are using Tire Ruby client. The ES cluster is behind HAProxy. Thus, all search, get, and update requests are (almost) equally distributed across all nodes.  

On Monday, July 30, 2012 4:39:29 PM UTC+2, Stéphane R. wrote:
Hi,

What kind of clients are you using ? Do they balance their queries
between the five nodes or do they always query the same ? If they do
so, it may explain this kind of behavior.

Best,

Stéphane

2012/7/30 Nitish Sharma <[hidden email]>:

> Hi Igor,
> I checked the stats and elasticsearch-head also confirmed that each node has
> equal number of shards. Moreover, interestingly, this weekend this behaviour
> (of constant high CPU usage) was taken over by another node and the node
> previously over-using CPU is now more or less *normal*. So, as far as I
> observed it, at any given point of time (atleast) 1 node would be doing *a
> lot* of pure-CPU, while other nodes are fairly quiet. Weird!
> We are not indexing documents with routing, neither updating them using
> routes.
> Any other pointers?
>
> Cheers
> Nitish
>
>
> On Saturday, July 28, 2012 12:47:34 AM UTC+2, Igor Motov wrote:
>>
>> Interesting.  Did you try to run curl
>> "localhost:9200/_nodes/stats?pretty=true" to make sure that uniformal
>> distribution of indexing operations is really the case?
>>
>> On Friday, July 27, 2012 6:15:29 PM UTC-4, Nitish Sharma wrote:
>>>
>>> We are, indeed, running a lot of "update" operations continuously but
>>> they are not routed to specific shards. The document to be updated can be
>>> present on any of the shards (on any of the nodes). And, as I mentioned, all
>>> shards are uniformly distributed across nodes.
>>>
>>> On Friday, July 27, 2012 10:12:56 PM UTC+2, Igor Motov wrote:
>>>>
>>>> It looks like this node is quite busy updating documents. Is it possible
>>>> that your indexing load is concentrated on the shards that just happened to
>>>> be located on this particular node?
>>>>
>>>>
>>>> On Friday, July 27, 2012 3:58:46 PM UTC-4, Nitish Sharma wrote:
>>>>>
>>>>> Hi Igor,
>>>>> I couldnt make any sense out of the jstack's dump (2000 lines long).
>>>>> May be you can help - http://pastebin.com/u57QB7ra?
>>>>>
>>>>> Cheers
>>>>> Nitish
>>>>>
>>>>> On Friday, July 27, 2012 6:04:18 PM UTC+2, Igor Motov wrote:
>>>>>>
>>>>>> Run jstack on the node that is using 600-700% of CPU and let's see
>>>>>> what it's doing.
>>>>>>
>>>>>> On Friday, July 27, 2012 9:45:27 AM UTC-4, Nitish Sharma wrote:
>>>>>>>
>>>>>>> HI,
>>>>>>> We have a 5-node ES cluster. On one particular node ES process is
>>>>>>> consuming 600-700% CPU (8 cores) all the time. While other nodes' CPU usage
>>>>>>> is always below 100%. We are running 0.19.8 and each node has equal number
>>>>>>> of shards.
>>>>>>> Any suggestions?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Nitish


--
 
 
12