CPU usage increase after running for a while (CacheRecycler?)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

es1b.txt (185K) Download Attachment
es7b.txt (214K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

kimchy
Administrator
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="tL_liCd87JkJ">jerome....@...> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="tL_liCd87JkJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

kimchy
Administrator
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="8RKUX62fYrUJ">jerome....@...> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="8RKUX62fYrUJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

kimchy
Administrator
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <[hidden email]> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
In reply to this post by Jérôme Gagnon
Cluster hot_threads gist

https://gist.github.com/jgagnon1/5013938

On Friday, February 22, 2013 9:44:29 AM UTC-5, Jérôme Gagnon wrote:
Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
In reply to this post by kimchy
So basically it's normal that my cpu is at 100% and 25 to 30% of the cpu is used by;

org.elasticsearch.common.CacheRecycler.pushIntArray() 23.500212 467026 ms (23.5%) 467026 ms

I'm only doing a facetting on a int field with low cardinality (5 different value possible)

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="-VdTzBxbr6wJ">jerome....@...> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="-VdTzBxbr6wJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
In reply to this post by kimchy
Most of the queries are doing facets on a low cardinality int field (5-6 differents values possible) with a facetFilter. I am not sure that it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="-VdTzBxbr6wJ">jerome....@...> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="-VdTzBxbr6wJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

kimchy
Administrator
How many concurrent requests are you executing? What is the search thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon <[hidden email]> wrote:

Most of the queries are doing facets on a low cardinality int field (5-6 differents values possible) with a facetFilter. I am not sure that it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="-VdTzBxbr6wJ">jerome....@...> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="-VdTzBxbr6wJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
We looked here at the CacheRecycler code, and we are not sure to see how the queue size could be decreasing because in popIntArray there is only a poll() called. So our theory is that the queue size is not bounded and the process is looping indefinitely in that when it's trying to add a int[]. Moreover, there seems to be a correlation between the CPU usage and the number of field eviction on our servers... so that could potentially be linked. Maybe you can help us with that ?

We are also not yet sure when those methods (pushIntArray and popIntArray) are called. We added some logging to check the size of the queue but we need to give time to see something in the log... 

For the information you asked;

Bigdesk shows ~60 QPS.
Thread pool setting;
threadpool:
        search:
                type: fixed
                size: 16
                min: 1
                queue_size: 64
                reject_policy: abort

Our machines have 8 core (2 quad core CPU)

Thank you,

Jerome

On Friday, February 22, 2013 11:02:37 AM UTC-5, kimchy wrote:
How many concurrent requests are you executing? What is the search thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="mngO63q8gzwJ">jerome....@...> wrote:

Most of the queries are doing facets on a low cardinality int field (5-6 differents values possible) with a facetFilter. I am not sure that it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <[hidden email]> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="mngO63q8gzwJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
We took another look at the CacheRecycler code and we saw that the poll method actually removes the head.

Since our profiling now points to offer() for CPU time, we took a look at the code of ConcurrentLinkedQueue and noticed that it tries to get the tail and if it fails (tail was changed by another thread), then it scans the whole list to find the new tail. So our conclusion is that either the size of the queue is either very large or the number of concurrent access is so high that it keeps rescanning the queue. If the second hypothesis (concurrent access) was the problem, we should see the high CPU as soon as we start the server, which is not the case. So now we think that the number of push is greater than the number of pop which would cause the queue to increase in size over time. Could that be the case? As a workaround, we are thinking about capping the queue size. 

What do you think?

On Friday, February 22, 2013 11:50:21 AM UTC-5, Jérôme Gagnon wrote:
We looked here at the CacheRecycler code, and we are not sure to see how the queue size could be decreasing because in popIntArray there is only a poll() called. So our theory is that the queue size is not bounded and the process is looping indefinitely in that when it's trying to add a int[]. Moreover, there seems to be a correlation between the CPU usage and the number of field eviction on our servers... so that could potentially be linked. Maybe you can help us with that ?

We are also not yet sure when those methods (pushIntArray and popIntArray) are called. We added some logging to check the size of the queue but we need to give time to see something in the log... 

For the information you asked;

Bigdesk shows ~60 QPS.
Thread pool setting;
threadpool:
        search:
                type: fixed
                size: 16
                min: 1
                queue_size: 64
                reject_policy: abort

Our machines have 8 core (2 quad core CPU)

Thank you,

Jerome

On Friday, February 22, 2013 11:02:37 AM UTC-5, kimchy wrote:
How many concurrent requests are you executing? What is the search thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon <[hidden email]> wrote:

Most of the queries are doing facets on a low cardinality int field (5-6 differents values possible) with a facetFilter. I am not sure that it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <[hidden email]> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

joergprante@gmail.com
I think your analysis is correct, in the sense that
ConcurrentLinkedQueue is a beast. Note, the size() method is not always
linear. In my tests, I observed edge situations, when the queue elements
are created and added faster than they are consumed, for example by slow
consumer threads, the CPU usage for size() - which must iterate through
the list - rises up more than linear. I'm not sure if this alone will
eat the CPU, but there is a price to pay for an unbounded concurrent
queue. Another option would be a bounded concurrent queue like
ArrayBlockingQueue (which has other shortcomings, the capacity can not
be changed). I must confess I found it much harder to program a bounded
concurrent queue than an unbounded concurrent queue (what happens if
offer/poll time out? when should they time out?)

Jörg

Am 22.02.13 18:15, schrieb Jérôme Gagnon:

> We took another look at the CacheRecycler code and we saw that the
> poll method actually removes the head.
>
> Since our profiling now points to offer() for CPU time, we took a look
> at the code of ConcurrentLinkedQueue and noticed that it tries to get
> the tail and if it fails (tail was changed by another thread), then it
> scans the whole list to find the new tail. So our conclusion is that
> either the size of the queue is either very large or the number of
> concurrent access is so high that it keeps rescanning the queue. If
> the second hypothesis (concurrent access) was the problem, we should
> see the high CPU as soon as we start the server, which is not the
> case. So now we think that the number of push is greater than the
> number of pop which would cause the queue to increase in size over
> time. Could that be the case? As a workaround, we are thinking about
> capping the queue size.
>
> What do you think?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

kimchy
Administrator
In reply to this post by Jérôme Gagnon
The number of push can't really be more then the number of pulls, assuming you have a bounded concurrent requests happening (like a bounded search thread pool). Both ConcurrentLinkedQueue/LinkedTransferQueue are non blocking, which means they retry on "failure" (ala compareAndSet). Note, Jorg point about size is not relevant, we don't call size.

What you see if really strange… . Can you try with a newer java version? If you still suspect the ConcurrentLinkedQueue, you can try and use a LinkedBlockingQueue instead and see if it helps.

Do you see the high CPU usage while driving the concurrent searches load? It really might just be ok to see high CPU usage, and has nothing to do with the queue itself. 16 concurrent shard level queries are allowed to execute, if you push enough to keep on filling it, I suspect you will see very high CPU usage (facets are typically very CPU intensive).

On Feb 22, 2013, at 6:15 PM, Jérôme Gagnon <[hidden email]> wrote:

We took another look at the CacheRecycler code and we saw that the poll method actually removes the head.

Since our profiling now points to offer() for CPU time, we took a look at the code of ConcurrentLinkedQueue and noticed that it tries to get the tail and if it fails (tail was changed by another thread), then it scans the whole list to find the new tail. So our conclusion is that either the size of the queue is either very large or the number of concurrent access is so high that it keeps rescanning the queue. If the second hypothesis (concurrent access) was the problem, we should see the high CPU as soon as we start the server, which is not the case. So now we think that the number of push is greater than the number of pop which would cause the queue to increase in size over time. Could that be the case? As a workaround, we are thinking about capping the queue size. 

What do you think?

On Friday, February 22, 2013 11:50:21 AM UTC-5, Jérôme Gagnon wrote:
We looked here at the CacheRecycler code, and we are not sure to see how the queue size could be decreasing because in popIntArray there is only a poll() called. So our theory is that the queue size is not bounded and the process is looping indefinitely in that when it's trying to add a int[]. Moreover, there seems to be a correlation between the CPU usage and the number of field eviction on our servers... so that could potentially be linked. Maybe you can help us with that ?

We are also not yet sure when those methods (pushIntArray and popIntArray) are called. We added some logging to check the size of the queue but we need to give time to see something in the log... 

For the information you asked;

Bigdesk shows ~60 QPS.
Thread pool setting;
threadpool:
        search:
                type: fixed
                size: 16
                min: 1
                queue_size: 64
                reject_policy: abort

Our machines have 8 core (2 quad core CPU)

Thank you,

Jerome

On Friday, February 22, 2013 11:02:37 AM UTC-5, kimchy wrote:
How many concurrent requests are you executing? What is the search thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon <[hidden email]> wrote:

Most of the queries are doing facets on a low cardinality int field (5-6 differents values possible) with a facetFilter. I am not sure that it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <[hidden email]> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage increase after running for a while (CacheRecycler?)

Jérôme Gagnon
So does, reducing the # of threads would help ? There are ~3 shards per node, so if you say that the size of thread pool is per lucene shard instance, that mean 48 search threads ? 

I know that facets can be cpu heavy, but I'm doing not an heavy number of query per second either. I'm not sure what could help next.

As for the record, I am using Java 7 u11 which is pretty much the latest.

On Friday, February 22, 2013 1:09:05 PM UTC-5, kimchy wrote:
The number of push can't really be more then the number of pulls, assuming you have a bounded concurrent requests happening (like a bounded search thread pool). Both ConcurrentLinkedQueue/LinkedTransferQueue are non blocking, which means they retry on "failure" (ala compareAndSet). Note, Jorg point about size is not relevant, we don't call size.

What you see if really strange… . Can you try with a newer java version? If you still suspect the ConcurrentLinkedQueue, you can try and use a LinkedBlockingQueue instead and see if it helps.

Do you see the high CPU usage while driving the concurrent searches load? It really might just be ok to see high CPU usage, and has nothing to do with the queue itself. 16 concurrent shard level queries are allowed to execute, if you push enough to keep on filling it, I suspect you will see very high CPU usage (facets are typically very CPU intensive).

On Feb 22, 2013, at 6:15 PM, Jérôme Gagnon <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="xVSdU9Ro3iMJ">jerome....@...> wrote:

We took another look at the CacheRecycler code and we saw that the poll method actually removes the head.

Since our profiling now points to offer() for CPU time, we took a look at the code of ConcurrentLinkedQueue and noticed that it tries to get the tail and if it fails (tail was changed by another thread), then it scans the whole list to find the new tail. So our conclusion is that either the size of the queue is either very large or the number of concurrent access is so high that it keeps rescanning the queue. If the second hypothesis (concurrent access) was the problem, we should see the high CPU as soon as we start the server, which is not the case. So now we think that the number of push is greater than the number of pop which would cause the queue to increase in size over time. Could that be the case? As a workaround, we are thinking about capping the queue size. 

What do you think?

On Friday, February 22, 2013 11:50:21 AM UTC-5, Jérôme Gagnon wrote:
We looked here at the CacheRecycler code, and we are not sure to see how the queue size could be decreasing because in popIntArray there is only a poll() called. So our theory is that the queue size is not bounded and the process is looping indefinitely in that when it's trying to add a int[]. Moreover, there seems to be a correlation between the CPU usage and the number of field eviction on our servers... so that could potentially be linked. Maybe you can help us with that ?

We are also not yet sure when those methods (pushIntArray and popIntArray) are called. We added some logging to check the size of the queue but we need to give time to see something in the log... 

For the information you asked;

Bigdesk shows ~60 QPS.
Thread pool setting;
threadpool:
        search:
                type: fixed
                size: 16
                min: 1
                queue_size: 64
                reject_policy: abort

Our machines have 8 core (2 quad core CPU)

Thank you,

Jerome

On Friday, February 22, 2013 11:02:37 AM UTC-5, kimchy wrote:
How many concurrent requests are you executing? What is the search thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon <[hidden email]> wrote:

Most of the queries are doing facets on a low cardinality int field (5-6 differents values possible) with a facetFilter. I am not sure that it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:
This might just be the CPU needed to compute the terms facet… . Might be that the sampling done to get the hot threads end up coming up with the addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon <[hidden email]> wrote:

Early update... contention seems to have moved from LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more updates to come

https://gist.github.com/jgagnon1/5013915

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:
By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:
First, I'm pretty sure there is no heap pressure, the heap usage is between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..


On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:
This is really strange…, I suggest two things: First, are you sure there is no memory pressure heap wise? Second, I pushed to 0.20 branch updated version of those concurrent collections, maybe you can give a go with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon <[hidden email]> wrote:

And for the whole cluster...

https://gist.github.com/jgagnon1/5007186

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon wrote:
Sure it's hapenning right now as a matter of fact...

https://gist.github.com/jgagnon1/5007106

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:
Can you issue hot threads when you see the increased CPU usage and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon <[hidden email]> wrote:

Edit2; We also reduced the search thread_pool size, since we think that all the thread are trying to call the method posted up there, and with a blocking call there is some kind of contention there, over time the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon wrote:
Edit; We believe this is not related to GC since CPU gc usage is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon wrote:
Hi everyone,

We are actually trying to put in prod our ES cluster, and we are having some cpu usage issues after some uptime. When we start, everything is running fine, but after a while we are experiencing an increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a screenshot of the cpu usage from last night while we experienced the issue (in the middle)

We added a cache expiry time and found out that it helped. (We are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler is one of the issue that we have specially at;

	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
	at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
	at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="xVSdU9Ro3iMJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
12