Max latency between nodes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Max latency between nodes

skik2skis
I currently have an elasticsearch cluster with 7 nodes.  Some of the connectivity between nodes is across fiber with 2 - 3ms latency between nodes.  About once a day we see a node drop from the cluster, a new master is elected, and then the dropped node returns to the cluster usually 30 - 45 seconds later.  The configuration on all nodes has been tweaked as follows to help tolerate the slight increase in latency but still seems to get a timeout when they drop.  Is it expected that even 2ms of latency would cause issues with the cluster?  If so, is there further configuration needed to make the cluster more tolerant of the latency?  Or should this latency be expected and I should investigate other root causes for the nodes dropping occasionally?  I've confirmed that we're never actually dropping packets between nodes, so something is going on that is causing them to not respond 5x60s pings.

zen-disco-node_failed([CDPX-PRD-ELS4][lkquUBfHT1aXAO3-_tCNCg][cdpx-prd-els4][inet[10.9.64.142/10.9.64.142:9300]]{master=false}), reason failed to ping, tried [5] times, each with maximum [1m] timeout

discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd49b82c-5496-48fd-8c8b-c47a42bb6d21%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Max latency between nodes

Binh Ly-2
Other than network, is it possible that your nodes could sometimes be overloaded such that they cannot respond immediately? If that's the case, then you can probably get 3 nodes (servers), make them master-only nodes (node.master: true, node.data: false). Set discovery.zen.minimum_master_nodes: 2 for those 3 nodes. And then for the rest of your other data nodes, make them non-master eligible (node.master: false, node.data: true). This way you have 3 nodes dedicated only to do cluster state/master tasks unimpeded by load or anything else other than your network. Just don't run anything else on them or send queries/indexing jobs to these 3 nodes. :)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8ad1d6d-2aaf-47ef-a901-4f661311dc25%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Max latency between nodes

skik2skis
We currently are running dedicated master nodes but I believe they are also servicing queries.  I can change it such that queries only hit the data nodes and see if that eliminates the issue...

On Monday, April 21, 2014 3:40:59 PM UTC-6, Binh Ly wrote:
Other than network, is it possible that your nodes could sometimes be overloaded such that they cannot respond immediately? If that's the case, then you can probably get 3 nodes (servers), make them master-only nodes (node.master: true, node.data: false). Set discovery.zen.minimum_master_nodes: 2 for those 3 nodes. And then for the rest of your other data nodes, make them non-master eligible (node.master: false, node.data: true). This way you have 3 nodes dedicated only to do cluster state/master tasks unimpeded by load or anything else other than your network. Just don't run anything else on them or send queries/indexing jobs to these 3 nodes. :)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Max latency between nodes

Alexander Reelsen-2
Hey,

is there any reason in the logfile of the master node, why it was deelected? (network outage as well)? Did you give your master nodes also a huge heap which could cause long outages during GC?


--Alex


On Mon, Apr 21, 2014 at 5:51 PM, <[hidden email]> wrote:
We currently are running dedicated master nodes but I believe they are also servicing queries.  I can change it such that queries only hit the data nodes and see if that eliminates the issue...

On Monday, April 21, 2014 3:40:59 PM UTC-6, Binh Ly wrote:
Other than network, is it possible that your nodes could sometimes be overloaded such that they cannot respond immediately? If that's the case, then you can probably get 3 nodes (servers), make them master-only nodes (node.master: true, node.data: false). Set discovery.zen.minimum_master_nodes: 2 for those 3 nodes. And then for the rest of your other data nodes, make them non-master eligible (node.master: false, node.data: true). This way you have 3 nodes dedicated only to do cluster state/master tasks unimpeded by load or anything else other than your network. Just don't run anything else on them or send queries/indexing jobs to these 3 nodes. :)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8uaMs3GANv-eF2TKbbfmOCjDuyVj-kEQ%3Dxi9AdZ8vsQw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Max latency between nodes

skik2skis
So far the only log message we've seen is:  zen-disco-node_failed([CDPX-PRD-ELS4][lkquUBfHT1aXAO3-_tCNCg][cdpx-prd-els4][inet[10.9.64.142/10.9.64.142:9300]]{master=false}), reason failed to ping, tried [5] times, each with maximum [1m] timeout

We have other data traversing the network that would be very sensitive to any latency or outages, in addition to alerts that would fire off if we had a network outage, so I am confident we don't have any network issues when this occurs.  Furthermore, we are only seeing data nodes drop, the masters never drop.

Is there a recommended heap size for nodes that are masters only?  In addition, any recommendations on heap size for data nodes?  I assume this could be a timeout in general during GC processes as our data nodes have larger heaps?

On Friday, April 25, 2014 5:49:44 PM UTC-6, Alexander Reelsen wrote:
Hey,

is there any reason in the logfile of the master node, why it was deelected? (network outage as well)? Did you give your master nodes also a huge heap which could cause long outages during GC?


--Alex


On Mon, Apr 21, 2014 at 5:51 PM, <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZHrR57xAK0gJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">skik...@...> wrote:
We currently are running dedicated master nodes but I believe they are also servicing queries.  I can change it such that queries only hit the data nodes and see if that eliminates the issue...

On Monday, April 21, 2014 3:40:59 PM UTC-6, Binh Ly wrote:
Other than network, is it possible that your nodes could sometimes be overloaded such that they cannot respond immediately? If that's the case, then you can probably get 3 nodes (servers), make them master-only nodes (node.master: true, node.data: false). Set discovery.zen.minimum_master_nodes: 2 for those 3 nodes. And then for the rest of your other data nodes, make them non-master eligible (node.master: false, node.data: true). This way you have 3 nodes dedicated only to do cluster state/master tasks unimpeded by load or anything else other than your network. Just don't run anything else on them or send queries/indexing jobs to these 3 nodes. :)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ZHrR57xAK0gJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e391a498-139a-4a8d-9c0c-7eb8402cfa89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Max latency between nodes

skik2skis
We're still seeing node drops, and what is more bizzare is we're seeing this on a test cluster we stood up that actually has no activity on it (no reads or writes going to it).  Does anyone have any additional thoughts?  Here is the info from the configuration and the logs we're seeing on the drops.

SC-TLS1 - 4GB Memory 1GB Heap (Master)

SC-TLS2 - 4GB Memory 1GB Heap (Master)

SC-TLS3 - 8GB Memory 1GB Heap (Data)

SC-TLS4 - 8GB Memory 1GB Heap (Data)

SC-TLS5 - 8GB Memory 1GB Heap (Data)

 

PX-TLS3 - 8GB Memory 1GB Heap (Data)

PX-TLS4 - 8GB Memory 1GB Heap (Data)

PX-TLS5 - 8GB Memory 1GB Heap (Data)

 

ElasticSearch 1.0.1

 

Elasticsearch Configuration Settings

bootstrap.mlockall: true

discovery.zen.ping.timeout: 15s

discovery.zen.ping.multicast.enabled: false

discovery.zen.ping.unicast.hosts: ["10.9.84.206[9300-9400]", "10.9.84.213[9300-9400]"]

action.destructive_requires_name: true

discovery.zen.fd.ping_interval: 30s

discovery.zen.fd.ping_timeout: 120s

discovery.zen.fd.ping_retries: 10

 

Events from TLS1 (Master)

[2014-05-26 22:09:22,953][INFO ][cluster.service          ] [SC-TLS1] removed {[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[/10.9.64.223:9300]]{dc=PX, master=false},}, reason: zen-disco-receive(from master [[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, data=false, master=true}])

[2014-05-26 22:12:07,085][INFO ][cluster.service          ] [SC-TLS1] added {[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[/10.9.64.223:9300]]{dc=PX, master=false},}, reason: zen-disco-receive(from master [[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, data=false, master=true}])

 

Events from PX-TLS5

[2014-05-26 22:09:37,010][INFO ][discovery.zen            ] [PX-TLS5] master_left [[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, data=false, master=true}], reason [do not exists on master, act as master failure]

[2014-05-26 22:09:37,011][INFO ][cluster.service          ] [PX-TLS5] master {new [SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, data=false, master=true}, previous [SCTLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, data=false, master=true}}, removed {[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, data=false, master=true},}, reason: zen-disco-master_failed ([SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, data=false, master=true})

[2014-05-26 22:10:07,035][INFO ][discovery.zen            ] [PX-TLS5] master_left [[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, data=false, master=true}], reason [no longer master]

[2014-05-26 22:10:07,036][WARN ][discovery.zen            ] [PX-TLS5] not enough master nodes after master left (reason = no longer master), current nodes: {[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[PX-TLS5/10.9.64.223:9300]]{dc=PX, master=false},[PX-PRD-TLS3][t9ZGWrc0Qi2ASDF5te75Pw][PX-prd-tls3][inet[/10.9.64.213:9300]]{dc=PX, master=false},[SC-TLS5][NulqNMVoQiu2nu4p6w8Usg][SC-tls5][inet[/10.9.84.210:9300]]{dc=SC, master=false},[SC-TLS4][DGWDAMr9QYmN5nNjFNMyjw][SC-tls4][inet[/10.9.84.209:9300]]{dc=SC, master=false},[SC-TLS3][0QNRAMFRSgizAfWO9yxBdw][SC-tls3][inet[/10.9.84.214:9300]]{dc=SC, master=false},[PX-PRD-TLS4][4gh2_7c2RiWY9MZQCuJtjw][PX-prd-tls4][inet[/10.9.64.214:9300]]{dc=PX, master=false},}

[2014-05-26 22:10:07,037][INFO ][cluster.service          ] [PX-TLS5] removed {[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, data=false, master=true},[PX-PRD-TLS3][t9ZGWrc0Qi2ASDF5te75Pw][PX-prd-tls3][inet[/10.9.64.213:9300]]{dc=PX, master=false},[SC-TLS5][NulqNMVoQiu2nu4p6w8Usg][SC-tls5][inet[/10.9.84.210:9300]]{dc=SC, master=false},[SC-TLS4][DGWDAMr9QYmN5nNjFNMyjw][SC-tls4][inet[/10.9.84.209:9300]]{dc=SC, master=false},[SC-TLS3][0QNRAMFRSgizAfWO9yxBdw][SC-tls3][inet[/10.9.84.214:9300]]{dc=SC, master=false},[PX-PRD-TLS4][4gh2_7c2RiWY9MZQCuJtjw][PX-prd-tls4][inet[/10.9.64.214:9300]]{dc=PX, master=false},}, reason: zen-disco-master_failed ([SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, data=false, master=true})

 


On Monday, April 28, 2014 9:39:04 AM UTC-6, [hidden email] wrote:
So far the only log message we've seen is:  zen-disco-node_failed([CDPX-PRD-ELS4][lkquUBfHT1aXAO3-_tCNCg][cdpx-prd-els4][inet[<a href="http://10.9.64.142/10.9.64.142:9300%5D%5D%7Bmaster=false%7D" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2F10.9.64.142%2F10.9.64.142%3A9300%255D%255D%257Bmaster%3Dfalse%257D\46sa\75D\46sntz\0751\46usg\75AFQjCNHtQc4N4mqyWy15guz1ncK1VTeAPg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2F10.9.64.142%2F10.9.64.142%3A9300%255D%255D%257Bmaster%3Dfalse%257D\46sa\75D\46sntz\0751\46usg\75AFQjCNHtQc4N4mqyWy15guz1ncK1VTeAPg';return true;">10.9.64.142/10.9.64.142:9300]]{master=false}), reason failed to ping, tried [5] times, each with maximum [1m] timeout

We have other data traversing the network that would be very sensitive to any latency or outages, in addition to alerts that would fire off if we had a network outage, so I am confident we don't have any network issues when this occurs.  Furthermore, we are only seeing data nodes drop, the masters never drop.

Is there a recommended heap size for nodes that are masters only?  In addition, any recommendations on heap size for data nodes?  I assume this could be a timeout in general during GC processes as our data nodes have larger heaps?

On Friday, April 25, 2014 5:49:44 PM UTC-6, Alexander Reelsen wrote:
Hey,

is there any reason in the logfile of the master node, why it was deelected? (network outage as well)? Did you give your master nodes also a huge heap which could cause long outages during GC?


--Alex


On Mon, Apr 21, 2014 at 5:51 PM, <[hidden email]> wrote:
We currently are running dedicated master nodes but I believe they are also servicing queries.  I can change it such that queries only hit the data nodes and see if that eliminates the issue...

On Monday, April 21, 2014 3:40:59 PM UTC-6, Binh Ly wrote:
Other than network, is it possible that your nodes could sometimes be overloaded such that they cannot respond immediately? If that's the case, then you can probably get 3 nodes (servers), make them master-only nodes (node.master: true, node.data: false). Set discovery.zen.minimum_master_nodes: 2 for those 3 nodes. And then for the rest of your other data nodes, make them non-master eligible (node.master: false, node.data: true). This way you have 3 nodes dedicated only to do cluster state/master tasks unimpeded by load or anything else other than your network. Just don't run anything else on them or send queries/indexing jobs to these 3 nodes. :)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dffb154d-080d-4366-980c-b9401a9b3859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.