Searches slow down significantly for several seconds every minute with transport client

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Searches slow down significantly for several seconds every minute with transport client

Daryl Robbins

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Mark Walkom-2
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <[hidden email]> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-u8heMB%2BffaoxeR%3DMxgO0_d9fsVi4mU-%3DfgqBWKMYfbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Daryl Robbins
Thanks for your response. GC was my first thought too. I have looked through the logs and ran the app through a profiler, I am not seeing any spike in GC activity or any other background thread when performance degrades. Also, the fact that the slowdown occurs exactly every minute at the same second would point me towards a more deliberate timeout or heartbeat.

I am running these tests in a controlled performance environment with constant light to moderate load. There is no change in the behaviour when under very light load. I have turned on slow logging for queries/fetches but am not seeing any slow queries corresponding with the problem. The only time I see a slow query is post-cold start of the search node, so it is at least working.

On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="64RbeGa_WZ0J" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">darylr...@...> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!

<a href="https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png" style="margin-left:1em;margin-right:1em" target="_blank" rel="nofollow" onmousedown="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;">


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="64RbeGa_WZ0J" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c78aa9dd-4a77-467c-ba88-41534db0f1ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Glen Smith
Have you run 'top' on the nodes?

On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
Thanks for your response. GC was my first thought too. I have looked through the logs and ran the app through a profiler, I am not seeing any spike in GC activity or any other background thread when performance degrades. Also, the fact that the slowdown occurs exactly every minute at the same second would point me towards a more deliberate timeout or heartbeat.

I am running these tests in a controlled performance environment with constant light to moderate load. There is no change in the behaviour when under very light load. I have turned on slow logging for queries/fetches but am not seeing any slow queries corresponding with the problem. The only time I see a slow query is post-cold start of the search node, so it is at least working.

On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <[hidden email]> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!

<a href="https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png" style="margin-left:1em;margin-right:1em" rel="nofollow" target="_blank" onmousedown="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;">


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d9cf22fb-ee33-4c24-8c43-14423993a056%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Daryl Robbins
Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only thing running at the time. I also checked the thread activity in JProfiler and nothing out of the ordinary popped up.

On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
Have you run 'top' on the nodes?

On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
Thanks for your response. GC was my first thought too. I have looked through the logs and ran the app through a profiler, I am not seeing any spike in GC activity or any other background thread when performance degrades. Also, the fact that the slowdown occurs exactly every minute at the same second would point me towards a more deliberate timeout or heartbeat.

I am running these tests in a controlled performance environment with constant light to moderate load. There is no change in the behaviour when under very light load. I have turned on slow logging for queries/fetches but am not seeing any slow queries corresponding with the problem. The only time I see a slow query is post-cold start of the search node, so it is at least working.

On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <[hidden email]> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!

<a href="https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png" style="margin-left:1em;margin-right:1em" rel="nofollow" target="_blank" onmousedown="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;">


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/75ba758e-c0c0-41ab-8c2c-9c6c042e3f0b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Glen Smith
Cool.

If I read right, your response time statistics graph includes 
1 - network latency between the client nodes and the load balancer
2 - network latency between the load balancer and the cluster eligible masters.
3 - performance of the load balancer
My interest in checking out 1 & 2 would depend on the network topology.
I would for sure want to do something to rule out 3. Any possibility of letting at least one of the client nodes
bypass the LB for a minute or two?

Then, I might be tempted to set up a script to hit _cat/thread_pool for 60 seconds at a time, with various of the thread pools/fields, looking for spikes.
Maybe the same thing with _nodes/stats.



On Wednesday, April 15, 2015 at 1:48:17 PM UTC-4, Daryl Robbins wrote:
Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only thing running at the time. I also checked the thread activity in JProfiler and nothing out of the ordinary popped up.

On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
Have you run 'top' on the nodes?

On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
Thanks for your response. GC was my first thought too. I have looked through the logs and ran the app through a profiler, I am not seeing any spike in GC activity or any other background thread when performance degrades. Also, the fact that the slowdown occurs exactly every minute at the same second would point me towards a more deliberate timeout or heartbeat.

I am running these tests in a controlled performance environment with constant light to moderate load. There is no change in the behaviour when under very light load. I have turned on slow logging for queries/fetches but am not seeing any slow queries corresponding with the problem. The only time I see a slow query is post-cold start of the search node, so it is at least working.

On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <[hidden email]> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!

<a href="https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png" style="margin-left:1em;margin-right:1em" rel="nofollow" target="_blank" onmousedown="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;">


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/15ffa0f9-92b5-4bb9-a4ec-064afa300b96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Daryl Robbins
Thank you, Glen. I appreciate your insight!

Here is our environment:

All nodes are running in a VPC within the same region of AWS, so inter-node latency should be very minimal.

I was thinking the same thing about the ES LB as well. I was wondering if we were hitting a keepalive timeout or if the level of indirection was otherwise creating a problem in the process. So, I tried removing the ES LB between the API Server nodes (ES clients) and the Eligible Masters earlier today. Each API node is now configured with the private IPs of the three eligible masters. There was no change in the observed behaviour following this change.

The Load Balancer in front of the API Servers is pre-warmed to 10,000 requests per second. And we're only throwing a couple hundred at it for the moment.

Thanks for the suggestion about polling various stats on the server. I'll see what I can rig up.

On Wednesday, April 15, 2015 at 3:38:04 PM UTC-4, Glen Smith wrote:
Cool.

If I read right, your response time statistics graph includes 
1 - network latency between the client nodes and the load balancer
2 - network latency between the load balancer and the cluster eligible masters.
3 - performance of the load balancer
My interest in checking out 1 & 2 would depend on the network topology.
I would for sure want to do something to rule out 3. Any possibility of letting at least one of the client nodes
bypass the LB for a minute or two?

Then, I might be tempted to set up a script to hit _cat/thread_pool for 60 seconds at a time, with various of the thread pools/fields, looking for spikes.
Maybe the same thing with _nodes/stats.



On Wednesday, April 15, 2015 at 1:48:17 PM UTC-4, Daryl Robbins wrote:
Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only thing running at the time. I also checked the thread activity in JProfiler and nothing out of the ordinary popped up.

On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
Have you run 'top' on the nodes?

On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
Thanks for your response. GC was my first thought too. I have looked through the logs and ran the app through a profiler, I am not seeing any spike in GC activity or any other background thread when performance degrades. Also, the fact that the slowdown occurs exactly every minute at the same second would point me towards a more deliberate timeout or heartbeat.

I am running these tests in a controlled performance environment with constant light to moderate load. There is no change in the behaviour when under very light load. I have turned on slow logging for queries/fetches but am not seeing any slow queries corresponding with the problem. The only time I see a slow query is post-cold start of the search node, so it is at least working.

On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <[hidden email]> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!

<a href="https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png" style="margin-left:1em;margin-right:1em" rel="nofollow" target="_blank" onmousedown="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;">


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77864063-f689-401d-b6bb-080481c6b63f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searches slow down significantly for several seconds every minute with transport client

Daryl Robbins
Well, it appears that this issue was actually unrelated to ElasticSearch after all. The problem was actually between the API Load Balancer and the API Server nodes. We are using ElasticBeanstalk, a managed application container, to host these API nodes. It turns out the Apache configuration was wrong in the Amazon gold image. The keep-alive and timeout settings were not set properly, resulting in timeouts on the load balancer every minute, which resulted in the massive spike in response time.

In the performance environment, there is long enough between tests for all the connections to time out form the LB. So, when the load starts up, all the connections to all the nodes are established at once, which is why they were all on the same schedule and also explains why this schedule would change over time (it was dependent on what time the test started).

Since correcting the Apache configuration, I have got the occurrence rate of requests taking longer than 500 ms down to 0.004% from 3%. And generally, these requests take 1 second instead of 2 - 4 seconds.

So, I would say that it is not resolved except for a little more optimization.

Thank you every one for your responses.

On Wednesday, April 15, 2015 at 4:05:27 PM UTC-4, Daryl Robbins wrote:
Thank you, Glen. I appreciate your insight!

Here is our environment:

<a href="https://lh3.googleusercontent.com/-PLejC0Yt98I/VS7BDRa23pI/AAAAAAAAAhk/MVoWqrRI8ls/s1600/ES%2BSetup.png" style="margin-left:1em;margin-right:1em" target="_blank" rel="nofollow" onmousedown="this.href='https://lh3.googleusercontent.com/-PLejC0Yt98I/VS7BDRa23pI/AAAAAAAAAhk/MVoWqrRI8ls/s1600/ES%2BSetup.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-PLejC0Yt98I/VS7BDRa23pI/AAAAAAAAAhk/MVoWqrRI8ls/s1600/ES%2BSetup.png';return true;">

All nodes are running in a VPC within the same region of AWS, so inter-node latency should be very minimal.

I was thinking the same thing about the ES LB as well. I was wondering if we were hitting a keepalive timeout or if the level of indirection was otherwise creating a problem in the process. So, I tried removing the ES LB between the API Server nodes (ES clients) and the Eligible Masters earlier today. Each API node is now configured with the private IPs of the three eligible masters. There was no change in the observed behaviour following this change.

The Load Balancer in front of the API Servers is pre-warmed to 10,000 requests per second. And we're only throwing a couple hundred at it for the moment.

Thanks for the suggestion about polling various stats on the server. I'll see what I can rig up.

On Wednesday, April 15, 2015 at 3:38:04 PM UTC-4, Glen Smith wrote:
Cool.

If I read right, your response time statistics graph includes 
1 - network latency between the client nodes and the load balancer
2 - network latency between the load balancer and the cluster eligible masters.
3 - performance of the load balancer
My interest in checking out 1 & 2 would depend on the network topology.
I would for sure want to do something to rule out 3. Any possibility of letting at least one of the client nodes
bypass the LB for a minute or two?

Then, I might be tempted to set up a script to hit _cat/thread_pool for 60 seconds at a time, with various of the thread pools/fields, looking for spikes.
Maybe the same thing with _nodes/stats.



On Wednesday, April 15, 2015 at 1:48:17 PM UTC-4, Daryl Robbins wrote:
Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only thing running at the time. I also checked the thread activity in JProfiler and nothing out of the ordinary popped up.

On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
Have you run 'top' on the nodes?

On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
Thanks for your response. GC was my first thought too. I have looked through the logs and ran the app through a profiler, I am not seeing any spike in GC activity or any other background thread when performance degrades. Also, the fact that the slowdown occurs exactly every minute at the same second would point me towards a more deliberate timeout or heartbeat.

I am running these tests in a controlled performance environment with constant light to moderate load. There is no change in the behaviour when under very light load. I have turned on slow logging for queries/fetches but am not seeing any slow queries corresponding with the problem. The only time I see a slow query is post-cold start of the search node, so it is at least working.

On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
Have you checked the logs for GC events or similar? What about the web logs for events coming in?

On 15 April 2015 at 09:03, Daryl Robbins <[hidden email]> wrote:

I am seeing a consistent bottleneck in requests (taking about 2+ seconds) at the same second every minute across all four of my client nodes who are connecting using the transport client from Java. These nodes are completely independent aside from their reliance on the ElasticSearch cluster and consequently they all happen to pause at the exact same second every minute. The exact second when this happens varies over time, but the four nodes always pause at the same time.

I have 4 web nodes that connect to my ES cluster via transport. They connect to a load balancer fronting our 3 dedicated master nodes. The cluster contains 2 or more data nodes dependent on the configuration. Regardless of the number, I am seeing the same symptoms.

Any hints on how to proceed to troubleshoot this issue on the ElasticSearch side would be greatly appreciated. Thanks very much!

<a href="https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png" style="margin-left:1em;margin-right:1em" rel="nofollow" target="_blank" onmousedown="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;" onclick="this.href='https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png';return true;">


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb3a1f30-b1b9-41ac-bbdb-da94135a3f6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.