ElasticSearch server lock up

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

ElasticSearch server lock up

John Bartak
We had an interesting problem today while running a 3 data node ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge).  One of the data nodes went to 100% CPU and stayed there.  Attempts to use the Java API to talk to the cluster hung. This cluster has generally been stable and there wasn't anything unusual going on that I was aware of when the problem occurred.  There was nothing written to the ElasticSearch logs on any of the instances.

ElasticSearch Head's main screen said the cluster health was green, but no nodes got displayed on the screen.  The Cluster State tab did display all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.  There were a bunch of indexing requests going on just prior to the lockup, but I'm not sure what else was happening.

I've attached a JStack dump from the instance.  (I had to run "jstack -F" to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

stack.txt (93K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Adrien Grand-2
This 100% CPU usage and the fact that the application was completely unresponsive make me think to a garbage collection issue. It could be that you had memory pressure issues. and the gc entered a never-ending full collection. Do you monitor garbage collection? Having garbage collection statistics right before the application got unresponsive would help confirm this is the issue.


On Thu, Jan 23, 2014 at 1:47 AM, JBartak <[hidden email]> wrote:
We had an interesting problem today while running a 3 data node ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge).  One of the data nodes went to 100% CPU and stayed there.  Attempts to use the Java API to talk to the cluster hung. This cluster has generally been stable and there wasn't anything unusual going on that I was aware of when the problem occurred.  There was nothing written to the ElasticSearch logs on any of the instances.

ElasticSearch Head's main screen said the cluster health was green, but no nodes got displayed on the screen.  The Cluster State tab did display all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.  There were a bunch of indexing requests going on just prior to the lockup, but I'm not sure what else was happening.

I've attached a JStack dump from the instance.  (I had to run "jstack -F" to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4nmvFrf6gBrQj5%3DPT3FwyLmYjKXCxPyWd_8CVhTM60NQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

John Bartak
We don't currently keep garbage collection statistics, but I'll add them just in case this ever comes up again.

For what it's worth, the memory settings for the Java process were:  -Xms9006m -Xmx9006m -Xss256k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

On Thursday, January 23, 2014 1:33:28 PM UTC-8, Adrien Grand wrote:
This 100% CPU usage and the fact that the application was completely unresponsive make me think to a garbage collection issue. It could be that you had memory pressure issues. and the gc entered a never-ending full collection. Do you monitor garbage collection? Having garbage collection statistics right before the application got unresponsive would help confirm this is the issue.


On Thu, Jan 23, 2014 at 1:47 AM, JBartak <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="qtqcMIbKsrsJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">john....@...> wrote:
We had an interesting problem today while running a 3 data node ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge).  One of the data nodes went to 100% CPU and stayed there.  Attempts to use the Java API to talk to the cluster hung. This cluster has generally been stable and there wasn't anything unusual going on that I was aware of when the problem occurred.  There was nothing written to the ElasticSearch logs on any of the instances.

ElasticSearch Head's main screen said the cluster health was green, but no nodes got displayed on the screen.  The Cluster State tab did display all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.  There were a bunch of indexing requests going on just prior to the lockup, but I'm not sure what else was happening.

I've attached a JStack dump from the instance.  (I had to run "jstack -F" to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="qtqcMIbKsrsJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com';return true;">https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/groups/opt_out" target="_blank" onmousedown="this.href='https://groups.google.com/groups/opt_out';return true;" onclick="this.href='https://groups.google.com/groups/opt_out';return true;">https://groups.google.com/groups/opt_out.



--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/76049f7a-a67a-46c6-8bcb-155b01c0b414%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
In reply to this post by John Bartak
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

I have also seen similar issues reported in previous versions of elasticsearch:
https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ
https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ

On Wednesday, January 22, 2014 7:47:31 PM UTC-5, JBartak wrote:
We had an interesting problem today while running a 3 data node ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge).  One of the data nodes went to 100% CPU and stayed there.  Attempts to use the Java API to talk to the cluster hung. This cluster has generally been stable and there wasn't anything unusual going on that I was aware of when the problem occurred.  There was nothing written to the ElasticSearch logs on any of the instances.

ElasticSearch Head's main screen said the cluster health was green, but no nodes got displayed on the screen.  The Cluster State tab did display all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.  There were a bunch of indexing requests going on just prior to the lockup, but I'm not sure what else was happening.

I've attached a JStack dump from the instance.  (I had to run "jstack -F" to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f727a1e8-7c1f-49f8-b4bc-f15040575b91%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

John Bartak
We're running on CentOS 6.3 and JDK 7u25.  My jstack dump shows lots of threads in ReferenceManager.acquire() as well.  Next time this happens, I'll definitely get the GC statistics and get the hot threads.

On Friday, January 24, 2014 8:17:38 AM UTC-8, Jay Modi wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

I have also seen similar issues reported in previous versions of elasticsearch:
<a href="https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ';return true;" onclick="this.href='https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ';return true;">https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ
<a href="https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ';return true;" onclick="this.href='https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ';return true;">https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ

On Wednesday, January 22, 2014 7:47:31 PM UTC-5, JBartak wrote:
We had an interesting problem today while running a 3 data node ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge).  One of the data nodes went to 100% CPU and stayed there.  Attempts to use the Java API to talk to the cluster hung. This cluster has generally been stable and there wasn't anything unusual going on that I was aware of when the problem occurred.  There was nothing written to the ElasticSearch logs on any of the instances.

ElasticSearch Head's main screen said the cluster health was green, but no nodes got displayed on the screen.  The Cluster State tab did display all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.  There were a bunch of indexing requests going on just prior to the lockup, but I'm not sure what else was happening.

I've attached a JStack dump from the instance.  (I had to run "jstack -F" to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f345ecc-988f-4da6-a6c4-00c52f813489%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
I'm not sure that the GC is to blame here but the statistics couldn't hurt and the hot threads will be the most telling IMO.

When we have seen GC problems, the JVM typically hits 100% CPU usage on a single core but not 100% total usage of all 4 cores like what we are seeing on our system and what I believe you are seeing.

On Friday, January 24, 2014 12:17:19 PM UTC-5, JBartak wrote:
We're running on CentOS 6.3 and JDK 7u25.  My jstack dump shows lots of threads in ReferenceManager.acquire() as well.  Next time this happens, I'll definitely get the GC statistics and get the hot threads.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/712c6f4b-9a93-490d-9b37-2c415a13082f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Adrien Grand-2
In reply to this post by Jay Modi
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6zhC_c%2BeD0RRkBfZfgej10%2Bcm1ZDWzLKHG3gahbd4-MA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Rq3eRQwPk5oJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">jay...@...> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bdb083a5-0728-44e9-ab02-2604b7298bc6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3b1d80a1-8e58-4728-933a-80f6d8ef4ed9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

simonw-3
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c2a1413-edf0-4fc4-8f10-f4c73196c83c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

John Bartak
On our case were were running on CentOS 6.4:

* We aren't deleting by query.
* We are running parent child queries (although not a lot of them and maybe not at the time this happened).  We do regularly run nested queries.
* We didn't see any exceptions in the logs.
* Not sure what you mean by embedding ElasticSearch, but we were running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to connect to it using the Java API.
* We have not reproduced the problem
* It happened while running 90.7, but we haven't upgraded past that yet.
* We are not calling flush or updating engine settings. 

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ccde3698-27e0-4c27-ae98-d38701840094%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

simonw-3
Thanks so much for the infos here! That is really helpful! Just to keep you updated, I committed a fix for the infinite loop to lucene 5 min ago (https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on porting the fixes to ES as we speak. We are adding additional checks to ES core to prevent this as much as possible. We will keep you updated. Nevertheless I guess upgrading to the latest version 0.90.11 would be good for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:
On our case were were running on CentOS 6.4

* We aren't deleting by query.
* We are running parent child queries (although not a lot of them and maybe not at the time this happened).  We do regularly run nested queries.
* We didn't see any exceptions in the logs.
* Not sure what you mean by embedding ElasticSearch, but we were running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to connect to it using the Java API.
* We have not reproduced the problem
* It happened while running 90.7, but we haven't upgraded past that yet.
* We are not calling flush or updating engine settings. 

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7367ea44-521c-4243-9a94-b4179d51602d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
In reply to this post by simonw-3
Hi Simon,

* Yes we use the delete by query
* We do not use the hasChild and hasParent queries but use filtered queries on parent ids like this:

        QueryBuilder query = QueryBuilders.matchQuery( "someField", value );

        FilterBuilder filter = FilterBuilders.boolFilter().must( FilterBuilders.termFilter( "_parent", id ) );

        QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, filter );

* No Exceptions in the logs
* Yes we embed elasticsearch in our application (running as a single node)
* No we cannot reproduce on demand; it has occurred during normal runtime of our application
* We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
* No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b53805b-9d9e-4bd1-bd45-2c30df2ddef1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

simonw-3
thanks for the info though!  Here is the pull request to throw an exception instead of going into an infinite loop 

https://github.com/elasticsearch/elasticsearch/pull/5043

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:
Hi Simon,

* Yes we use the delete by query
* We do not use the hasChild and hasParent queries but use filtered queries on parent ids like this:

        QueryBuilder query = QueryBuilders.matchQuery( "someField", value );

        FilterBuilder filter = FilterBuilders.boolFilter().must( FilterBuilders.termFilter( "_parent", id ) );

        QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, filter );

* No Exceptions in the logs
* Yes we embed elasticsearch in our application (running as a single node)
* No we cannot reproduce on demand; it has occurred during normal runtime of our application
* We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
* No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/69f9211a-1aaa-4fa7-b581-624face2091b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

simonw-2
Hey folks,

am I right that you both use scan / scroll? Can you tell me a bit about how you use it and if there could be long intervals between fetching results?

simon
On Thursday, February 6, 2014 9:34:45 PM UTC+1, simonw wrote:
thanks for the info though!  Here is the pull request to throw an exception instead of going into an infinite loop 

<a href="https://github.com/elasticsearch/elasticsearch/pull/5043" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F5043\46sa\75D\46sntz\0751\46usg\75AFQjCNEptCGNKOJx_yhhfFk7Lh_1smSUyA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F5043\46sa\75D\46sntz\0751\46usg\75AFQjCNEptCGNKOJx_yhhfFk7Lh_1smSUyA';return true;">https://github.com/elasticsearch/elasticsearch/pull/5043

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:
Hi Simon,

* Yes we use the delete by query
* We do not use the hasChild and hasParent queries but use filtered queries on parent ids like this:

        QueryBuilder query = QueryBuilders.matchQuery( "someField", value );

        FilterBuilder filter = FilterBuilders.boolFilter().must( FilterBuilders.termFilter( "_parent", id ) );

        QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, filter );

* No Exceptions in the logs
* Yes we embed elasticsearch in our application (running as a single node)
* No we cannot reproduce on demand; it has occurred during normal runtime of our application
* We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
* No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bf685c5-1abd-4a58-9341-3d19f57465f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

RE: ElasticSearch server lock up

John Bartak

We definitely use scan/scroll a lot.  It primarily gets used as part of our indexing process.  I would anticipate that we generally only get back a single page of results.  I would expect the calls to scrollResponse.getHits to happen in fairly quick succession.

 

We typically use something that looks like the following:

 

       TimeValue scrollKeepAlive = new TimeValue(60000);

 

        SearchRequestBuilder searchRequest = client.client().prepareSearch(indices);                          // always either 1 or 2 indices

                .setSearchType(SearchType.SCAN)

                .setTypes(type)

                .setScroll(scrollKeepAlive)

                .setFilter(filter)

                .setSize(500);

        addFieldsToSearchRequest(searchRequest, fieldList);

 

        org.elasticsearch.action.search.SearchResponse scrollResp = searchRequest.execute().actionGet();

 

        if (scrollResp.status() != RestStatus.OK) {

            String msg = "Error running query. Status=" + scrollResp.status();

            throw new SearcherException(msg);

        }

 

        while (true) {

            scrollResp = client.client().prepareSearchScroll(scrollResp.getScrollId()).setScroll(scrollKeepAlive).execute().actionGet();

 

            if (scrollResp.status() != RestStatus.OK) {

                String msg = "Error running query. Status=" + scrollResp.status();

                throw new SearcherException(msg);

            }

 

            if (scrollResp.getHits().hits().length == 0)

                break;

 

            for (SearchHit hit : scrollResp.getHits()) {

                    // DO SOMETHING WITH THE HIT (that doesn’t take very long)

            }

        }

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of simonw
Sent: Thursday, February 06, 2014 1:18 PM
To: [hidden email]
Subject: Re: ElasticSearch server lock up

 

Hey folks,

 

am I right that you both use scan / scroll? Can you tell me a bit about how you use it and if there could be long intervals between fetching results?

 

simon
On Thursday, February 6, 2014 9:34:45 PM UTC+1, simonw wrote:

thanks for the info though!  Here is the pull request to throw an exception instead of going into an infinite loop 

 

 

I assume you are not indexing into this embedded instance?

 

simon


On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:

Hi Simon,

 

* Yes we use the delete by query

* We do not use the hasChild and hasParent queries but use filtered queries on parent ids like this:

        QueryBuilder query = QueryBuilders.matchQuery( "someField", value );

        FilterBuilder filter = FilterBuilders.boolFilter().must( FilterBuilders.termFilter( "_parent", id ) );

        QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, filter );

* No Exceptions in the logs

* Yes we embed elasticsearch in our application (running as a single node)

* No we cannot reproduce on demand; it has occurred during normal runtime of our application

* We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)

* No not forcefully flushing


On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:

Hey folks,

 

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 

 * are you using delete by query

 * are you running parent child queries

 * are you seeing any exceptions in the logs

 * are you embedding elasticsearch

 * can you reproduce the problem and if so what are you doing to do so.

 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.

 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

 

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

 

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.


On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to query elasticsearch.

 

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

 

This looks bad. Are you using any particular plugin that may acquire/release searchers?


--

Adrien Grand

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/zK5pLxVeLJ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bf685c5-1abd-4a58-9341-3d19f57465f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/A23CDD02F91F7D4293CEA2CEA11E5BF20EDF4455%40005-TK5MPN2-021.MGDADSK.autodesk.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
In reply to this post by simonw-3
Yes we do index into the embedded instance; our work load typically consists of data incoming and a initial index and then after that most operations are updates and retrievals. Is there a specific reason you ask?

As for scan/scroll, I don't see anywhere that we use it.

On Thursday, February 6, 2014 3:34:45 PM UTC-5, simonw wrote:
thanks for the info though!  Here is the pull request to throw an exception instead of going into an infinite loop 

<a href="https://github.com/elasticsearch/elasticsearch/pull/5043" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F5043\46sa\75D\46sntz\0751\46usg\75AFQjCNEptCGNKOJx_yhhfFk7Lh_1smSUyA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fpull%2F5043\46sa\75D\46sntz\0751\46usg\75AFQjCNEptCGNKOJx_yhhfFk7Lh_1smSUyA';return true;">https://github.com/elasticsearch/elasticsearch/pull/5043

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:
Hi Simon,

* Yes we use the delete by query
* We do not use the hasChild and hasParent queries but use filtered queries on parent ids like this:

        QueryBuilder query = QueryBuilders.matchQuery( "someField", value );

        FilterBuilder filter = FilterBuilders.boolFilter().must( FilterBuilders.termFilter( "_parent", id ) );

        QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, filter );

* No Exceptions in the logs
* Yes we embed elasticsearch in our application (running as a single node)
* No we cannot reproduce on demand; it has occurred during normal runtime of our application
* We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
* No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db483929-c009-4e83-965c-81c2011a18ed%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
In reply to this post by simonw-3
I took the latest patch from the Lucene issue and patched the 4.6.0 tag and started using that jar in an instance (running 0.90.8). It looks like that instance hit the condition that previously would have caused the infinite loop. Our logs filled up with exception messages about failing to acquire a searcher. I have created a gist of a portion of the log file with all of the failed to acquire messages https://gist.github.com/jaymode/a73a9695e31d7ec53f3f

This appears to have gone on for about 15 to 30 minutes and then it looks as though a flush occurred and the errors went away, see the following gist: https://gist.github.com/jaymode/421919a3471f8358d5d4

Doing some log analysis I have received these counts of log messages:

Count     Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source search
2              [Collector 3] [elements][2] failed to acquire searcher, source search_factory
1              [Collector 3] [elements][2] failed to prepare/warm

1              [Collector 3] [elements][2] Failed to close current SearcherManager

It is interesting that it failed to prepare/warm because we do not explicitly call any index warmer. 

Is there something to be done here so the logs do not fill up with these messages? We save about 550mb of logs and we churned through that much in about 15 minutes when this happened, so I do not have log message from the beginning of this event.

On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:
Thanks so much for the infos here! That is really helpful! Just to keep you updated, I committed a fix for the infinite loop to lucene 5 min ago (<a href="https://issues.apache.org/jira/browse/LUCENE-5436" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-5436\46sa\75D\46sntz\0751\46usg\75AFQjCNG8mFuQ8fLcdUXlnOfVBqhpm2_hmA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-5436\46sa\75D\46sntz\0751\46usg\75AFQjCNG8mFuQ8fLcdUXlnOfVBqhpm2_hmA';return true;">https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on porting the fixes to ES as we speak. We are adding additional checks to ES core to prevent this as much as possible. We will keep you updated. Nevertheless I guess upgrading to the latest version 0.90.11 would be good for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:
On our case were were running on CentOS 6.4

* We aren't deleting by query.
* We are running parent child queries (although not a lot of them and maybe not at the time this happened).  We do regularly run nested queries.
* We didn't see any exceptions in the logs.
* Not sure what you mean by embedding ElasticSearch, but we were running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to connect to it using the Java API.
* We have not reproduced the problem
* It happened while running 90.7, but we haven't upgraded past that yet.
* We are not calling flush or updating engine settings. 

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5e38e60b-3f23-4a6e-9783-ec73d65e59b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

simonw-2
Jay, first of all good that this prevents the server from going into an infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn clean -DskipTests packge) and deploy the artifact that is in target/release/  - we added some stuff to prevent this can you maybe give it a go if that fixes your problem?

simon

On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:
I took the latest patch from the Lucene issue and patched the 4.6.0 tag and started using that jar in an instance (running 0.90.8). It looks like that instance hit the condition that previously would have caused the infinite loop. Our logs filled up with exception messages about failing to acquire a searcher. I have created a gist of a portion of the log file with all of the failed to acquire messages <a href="https://gist.github.com/jaymode/a73a9695e31d7ec53f3f" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2Fa73a9695e31d7ec53f3f\46sa\75D\46sntz\0751\46usg\75AFQjCNERR8irZ30JcP3g3aIr7SSNeSzAOw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2Fa73a9695e31d7ec53f3f\46sa\75D\46sntz\0751\46usg\75AFQjCNERR8irZ30JcP3g3aIr7SSNeSzAOw';return true;">https://gist.github.com/jaymode/a73a9695e31d7ec53f3f

This appears to have gone on for about 15 to 30 minutes and then it looks as though a flush occurred and the errors went away, see the following gist: <a href="https://gist.github.com/jaymode/421919a3471f8358d5d4" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2F421919a3471f8358d5d4\46sa\75D\46sntz\0751\46usg\75AFQjCNGqyAJ6EslltO65wPaFe-EdVL442A';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2F421919a3471f8358d5d4\46sa\75D\46sntz\0751\46usg\75AFQjCNGqyAJ6EslltO65wPaFe-EdVL442A';return true;">https://gist.github.com/jaymode/421919a3471f8358d5d4

Doing some log analysis I have received these counts of log messages:

Count     Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source search
2              [Collector 3] [elements][2] failed to acquire searcher, source search_factory
1              [Collector 3] [elements][2] failed to prepare/warm

1              [Collector 3] [elements][2] Failed to close current SearcherManager

It is interesting that it failed to prepare/warm because we do not explicitly call any index warmer. 

Is there something to be done here so the logs do not fill up with these messages? We save about 550mb of logs and we churned through that much in about 15 minutes when this happened, so I do not have log message from the beginning of this event.

On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:
Thanks so much for the infos here! That is really helpful! Just to keep you updated, I committed a fix for the infinite loop to lucene 5 min ago (<a href="https://issues.apache.org/jira/browse/LUCENE-5436" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-5436\46sa\75D\46sntz\0751\46usg\75AFQjCNG8mFuQ8fLcdUXlnOfVBqhpm2_hmA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-5436\46sa\75D\46sntz\0751\46usg\75AFQjCNG8mFuQ8fLcdUXlnOfVBqhpm2_hmA';return true;">https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on porting the fixes to ES as we speak. We are adding additional checks to ES core to prevent this as much as possible. We will keep you updated. Nevertheless I guess upgrading to the latest version 0.90.11 would be good for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:
On our case were were running on CentOS 6.4

* We aren't deleting by query.
* We are running parent child queries (although not a lot of them and maybe not at the time this happened).  We do regularly run nested queries.
* We didn't see any exceptions in the logs.
* Not sure what you mean by embedding ElasticSearch, but we were running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to connect to it using the Java API.
* We have not reproduced the problem
* It happened while running 90.7, but we haven't upgraded past that yet.
* We are not calling flush or updating engine settings. 

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/71e906b3-ee7c-49d3-9072-40ba4e935159%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch server lock up

Jay Modi
Simon, prior to this post I decided to try a custom build from the 0.90.11 tag with the following commits cherry picked onto my branch:
ad1097f1ba109d6cb235ba541251ba63abb27c16
b4ec18814b3eeb35d948c01abec3e04745f57458
93e9d2146e77f6c0523875b93c768ab7f81cfe04
6056d2cb5553e6277b023e6860739847fbd95bb7

I tried to minimize changes from the latest released version to the subset that deal with changes for managing references. I'll see how this works.

Also, we were able to get some log information (client side) to find out more details. The problems started occurring around 4:16am EST and continued until 4:45am EST. We noticed issues started with a bulk index request (not sure if that helps at all).

On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote:
Jay, first of all good that this prevents the server from going into an infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn clean -DskipTests packge) and deploy the artifact that is in target/release/  - we added some stuff to prevent this can you maybe give it a go if that fixes your problem?

simon

On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:
I took the latest patch from the Lucene issue and patched the 4.6.0 tag and started using that jar in an instance (running 0.90.8). It looks like that instance hit the condition that previously would have caused the infinite loop. Our logs filled up with exception messages about failing to acquire a searcher. I have created a gist of a portion of the log file with all of the failed to acquire messages <a href="https://gist.github.com/jaymode/a73a9695e31d7ec53f3f" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2Fa73a9695e31d7ec53f3f\46sa\75D\46sntz\0751\46usg\75AFQjCNERR8irZ30JcP3g3aIr7SSNeSzAOw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2Fa73a9695e31d7ec53f3f\46sa\75D\46sntz\0751\46usg\75AFQjCNERR8irZ30JcP3g3aIr7SSNeSzAOw';return true;">https://gist.github.com/jaymode/a73a9695e31d7ec53f3f

This appears to have gone on for about 15 to 30 minutes and then it looks as though a flush occurred and the errors went away, see the following gist: <a href="https://gist.github.com/jaymode/421919a3471f8358d5d4" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2F421919a3471f8358d5d4\46sa\75D\46sntz\0751\46usg\75AFQjCNGqyAJ6EslltO65wPaFe-EdVL442A';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fjaymode%2F421919a3471f8358d5d4\46sa\75D\46sntz\0751\46usg\75AFQjCNGqyAJ6EslltO65wPaFe-EdVL442A';return true;">https://gist.github.com/jaymode/421919a3471f8358d5d4

Doing some log analysis I have received these counts of log messages:

Count     Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source search
2              [Collector 3] [elements][2] failed to acquire searcher, source search_factory
1              [Collector 3] [elements][2] failed to prepare/warm

1              [Collector 3] [elements][2] Failed to close current SearcherManager

It is interesting that it failed to prepare/warm because we do not explicitly call any index warmer. 

Is there something to be done here so the logs do not fill up with these messages? We save about 550mb of logs and we churned through that much in about 15 minutes when this happened, so I do not have log message from the beginning of this event.

On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:
Thanks so much for the infos here! That is really helpful! Just to keep you updated, I committed a fix for the infinite loop to lucene 5 min ago (<a href="https://issues.apache.org/jira/browse/LUCENE-5436" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-5436\46sa\75D\46sntz\0751\46usg\75AFQjCNG8mFuQ8fLcdUXlnOfVBqhpm2_hmA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-5436\46sa\75D\46sntz\0751\46usg\75AFQjCNG8mFuQ8fLcdUXlnOfVBqhpm2_hmA';return true;">https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on porting the fixes to ES as we speak. We are adding additional checks to ES core to prevent this as much as possible. We will keep you updated. Nevertheless I guess upgrading to the latest version 0.90.11 would be good for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:
On our case were were running on CentOS 6.4

* We aren't deleting by query.
* We are running parent child queries (although not a lot of them and maybe not at the time this happened).  We do regularly run nested queries.
* We didn't see any exceptions in the logs.
* Not sure what you mean by embedding ElasticSearch, but we were running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to connect to it using the Java API.
* We have not reproduced the problem
* It happened while running 90.7, but we haven't upgraded past that yet.
* We are not calling flush or updating engine settings. 

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

 * are you using delete by query
 * are you running parent child queries
 * are you seeing any exceptions in the logs
 * are you embedding elasticsearch
 * can you reproduce the problem and if so what are you doing to do so.
 * can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
 * are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire. 

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <[hidden email]> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a83dd334-3fab-4ba0-9668-59210b98137b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
12