Random node disconnects in Azure, no resource issues as near as I can tell

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Random node disconnects in Azure, no resource issues as near as I can tell

Eric Brandes
I have a 3 node cluster running ES 1.0.1 in Azure.  They're windows VMs with 7GB of RAM.  The JVM heap size is allocated at 4GB per node.  There is a single index in the cluster with 50 shards and 1 replica.  The total number of documents on primary shards is 29 million with a store size of 60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster.  The usual suspect is a ping timeout.  The longest GC in the logs is about 1 sec, and the boxes don't look resource constrained really at all. CPU never goes above 20%. The used JVM heap size never goes above 6gb (the total on the cluster is 12gb) and the field data cache never gets over 1gb.  The node that drops out is different every day.  I have minimum_number_master_nodes set so there's not any kind of split brain scenario, but there are times where the disconnected node NEVER rejoins until I bounce the process.

Has anyone seen this before?  Is it an Azure networking issue?  How can I tell?  If it's resource problems, what's the best way for me to turn on logging to diagnose them?  What else can I tell you or what other steps can I take to figure this out?  It's really quite maddening :(

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Random node disconnects in Azure, no resource issues as near as I can tell

dadoonet
Just checking: are you using azure cloud plugin or unicast list of nodes?

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 30 mai 2014 à 02:12, Eric Brandes <[hidden email]> a écrit :

I have a 3 node cluster running ES 1.0.1 in Azure.  They're windows VMs with 7GB of RAM.  The JVM heap size is allocated at 4GB per node.  There is a single index in the cluster with 50 shards and 1 replica.  The total number of documents on primary shards is 29 million with a store size of 60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster.  The usual suspect is a ping timeout.  The longest GC in the logs is about 1 sec, and the boxes don't look resource constrained really at all. CPU never goes above 20%. The used JVM heap size never goes above 6gb (the total on the cluster is 12gb) and the field data cache never gets over 1gb.  The node that drops out is different every day.  I have minimum_number_master_nodes set so there's not any kind of split brain scenario, but there are times where the disconnected node NEVER rejoins until I bounce the process.

Has anyone seen this before?  Is it an Azure networking issue?  How can I tell?  If it's resource problems, what's the best way for me to turn on logging to diagnose them?  What else can I tell you or what other steps can I take to figure this out?  It's really quite maddening :(

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DE1520AB-0E38-440A-869C-A69ECE9A5295%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Random node disconnects in Azure, no resource issues as near as I can tell

Eric Brandes
I'm using the unicast list of nodes at the moment. I have multicast turned off as well.  I have not changed the default ping timeout or anything.

On Thursday, May 29, 2014 7:37:38 PM UTC-5, David Pilato wrote:
Just checking: are you using azure cloud plugin or unicast list of nodes?

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 30 mai 2014 à 02:12, Eric Brandes <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="UzQOh6LPlpEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">eric.b...@...> a écrit :

I have a 3 node cluster running ES 1.0.1 in Azure.  They're windows VMs with 7GB of RAM.  The JVM heap size is allocated at 4GB per node.  There is a single index in the cluster with 50 shards and 1 replica.  The total number of documents on primary shards is 29 million with a store size of 60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster.  The usual suspect is a ping timeout.  The longest GC in the logs is about 1 sec, and the boxes don't look resource constrained really at all. CPU never goes above 20%. The used JVM heap size never goes above 6gb (the total on the cluster is 12gb) and the field data cache never gets over 1gb.  The node that drops out is different every day.  I have minimum_number_master_nodes set so there's not any kind of split brain scenario, but there are times where the disconnected node NEVER rejoins until I bounce the process.

Has anyone seen this before?  Is it an Azure networking issue?  How can I tell?  If it's resource problems, what's the best way for me to turn on logging to diagnose them?  What else can I tell you or what other steps can I take to figure this out?  It's really quite maddening :(

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="UzQOh6LPlpEJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7671194d-3059-4220-9da5-c4e1aa169072%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Random node disconnects in Azure, no resource issues as near as I can tell

Michael Delaney
Are u using internal fully qualified domain names, e.g es01.myelasticsearcservice.f3.internal.net
If you use public load balancer end points you'll get timeouts.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6b7a52d-84a8-46d3-a42f-2a708922e567%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Random node disconnects in Azure, no resource issues as near as I can tell

Eric Brandes
The three nodes are connected by an Azure virtual network. They are all part of a single cloud service, operating in a load balanced set.  I am not currently using any kind of FQDN, so the unicast host names are "es-machine-1", "es-machine-2" etc. No domain suffix whatsoever.  As far as I know that is end-arounding the public load balancer (since none of those hostnames are publicly accessible to machines outside the virtual network).  But I've been wrong before :)  I actually can't find any kind of fully qualified domain name for those machines, other than the public facing cloudapp.net one, so I assume this is OK?  I've also tried using the internal virtual network IP addresses on a similarly specced development cluster, and I see the same timeouts there.

On Friday, May 30, 2014 1:40:47 AM UTC-5, Michael Delaney wrote:
Are u using internal fully qualified domain names, e.g <a href="http://es01.myelasticsearcservice.f3.internal.net" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fes01.myelasticsearcservice.f3.internal.net\46sa\75D\46sntz\0751\46usg\75AFQjCNEmewe4_TVHyM4gCzXV2Vw0TUOo5Q';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fes01.myelasticsearcservice.f3.internal.net\46sa\75D\46sntz\0751\46usg\75AFQjCNEmewe4_TVHyM4gCzXV2Vw0TUOo5Q';return true;">es01.myelasticsearcservice.f3.internal.net
If you use public load balancer end points you'll get timeouts.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd26798c-66ef-4881-88ea-72d9df2e16a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.