Quantcast

Cluster Split Brain

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Cluster Split Brain

phobos182
This post was updated on .
We have been having issues with our cluster split braining a few times this week. Here is the error logs. Anything I can take a look at?

http://www.pastie.org/2527719
http://www.pastie.org/pastes/2527779
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster Split Brain

kimchy
Administrator
Can you check the logs, see why the nodes disconnect (you should see log messages on nodes being disconnected).

On Tue, Sep 13, 2011 at 9:43 PM, phobos182 <[hidden email]> wrote:
We have been having issues with our cluster split braining a few times this
week. Here is the error logs. Anything I can take a look at?

http://www.pastie.org/2527719


--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Cluster-Split-Brain-tp3333510p3333510.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster Split Brain

phobos182
This post was updated on .
It happened again, and this time I have the log files.

192.168.200.110
----
http://www.pastie.org/pastes/2537779


192.168.200.109
----
http://www.pastie.org/pastes/2537721

Master
----
http://www.pastie.org/pastes/2537751
http://www.pastie.org/pastes/2537894

It looks like the "Master" went offline where it could not ping anybody else, or something to that fact. But it came back online shortly. Why would the cluster split brain if one node went down (That happens to be the master)? I'm sure the cluster had quorum to route around a failed node.

It also looks like the master went haywire with the cluster updates during that time. Went from version 900 to 925 in less than a second.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster Split Brain

kimchy
Administrator
I don't see the log messages of the node being disconnected from the cluster.

On Thu, Sep 15, 2011 at 6:04 PM, phobos182 <[hidden email]> wrote:
It happened again, and this time I have the log files.

http://www.pastie.org/pastes/2537721

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Cluster-Split-Brain-tp3333510p3339227.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cluster Split Brain

phobos182
This post was updated on .
I wanted to follow up on this.

It turns out that the version of the CentOS 6 kernel that we were using had an issue with the bnx2 driver for the BroadCom NICs in our servers which would cause very short lived, but temporary outages for the interfaces.

We updated the Kernel drivers, rebooted each node, and have not seen this error since. The tip off was the error message in the logs that a server could not be pinged for 20-30s. Considering we have a very robust 480Gbit/s network core with 10GB LAG's, this pointed directly to the servers.
Loading...