Quantcast

Split brain?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Split brain?

Darron Froese
I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

1. Why this happened.
2. How I can recover from this.
3. How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

kimchy
Administrator
Heya:

First, regarding the split brain, it can obviously happen, especially with 2 servers. If the network gets disconnected between the two for example. In this case, you will have two separate one node cluster and you will need to resolve it yourself by restarting one of them. You should see in the logs the fact that one node got disconnected from the other. If you had a larger cluster you could have defined "minimum_master_nodes" parameter to reduce the chances of it happening.

I am not sure why when you restart the node its not finding the other node. I am assuming you are using unicast discovery, are you sure its configured properly? You can set discovery: TRACE in the logging.yml file to see which nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]> wrote:
I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

1. Why this happened.
2. How I can recover from this.
3. How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

Darron Froese
I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

1. Should I put all of the IPs of all of the nodes in there?
2. If I go up to a 3 node cluster - should I put "minimum_master_nodes" to 2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon <[hidden email]> wrote:

> Heya:
>
> First, regarding the split brain, it can obviously happen, especially with 2
> servers. If the network gets disconnected between the two for example. In
> this case, you will have two separate one node cluster and you will need to
> resolve it yourself by restarting one of them. You should see in the logs
> the fact that one node got disconnected from the other. If you had a larger
> cluster you could have defined "minimum_master_nodes" parameter to reduce
> the chances of it happening.
>
> I am not sure why when you restart the node its not finding the other node.
> I am assuming you are using unicast discovery, are you sure its configured
> properly? You can set discovery: TRACE in the logging.yml file to see which
> nodes it tries to ping and what the status of that is.
>
> -shay.banon
>
>
> On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]> wrote:
>>
>> I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
>> the last 3 weeks. It's been working great and we've had several
>> million records inserted and deleted in that time.
>>
>> Here is the config:
>>
>> http://d.pr/WbU7
>>
>> Yesterday I updated the boxes to 0.18.6 (and was trying to get the
>> boxes to log to syslog as well) - it appears that something didn't
>> work so well during the upgrade and I was left with 2 boxes both
>> thinking that they're masters.
>>
>> Here are the logs from the boxes during the upgrade:
>>
>> http://d.pr/HHbC
>> http://d.pr/n3jk
>>
>> And then from the next day:
>>
>> http://d.pr/QEkU
>> http://d.pr/I7fo
>>
>> I tried to get them to re-connect, but I couldn't get anything to work
>> correctly - they were both completely separate.
>>
>> I now have a single box with the correct index:
>>
>> http://d.pr/4TTq
>>
>> I have a chef recipe that builds new elasticsearch boxes so I spun up
>> a new box and tried to get it to join the cluster, but no dice - it's
>> like none of the other boxes exist. I've also tried to go back down to
>> 0.18.5 - no dice.
>>
>> Is there a way I can point a new box at that master directly and say:
>> "Hey you're a slave, there is the master."
>>
>> I'm a bit of a loss here - and don't want to admit defeat - but I'm a
>> little lost here.
>>
>> It's a system that we're building and I CAN lose the data, but I
>> really want to understand:
>>
>> 1. Why this happened.
>> 2. How I can recover from this.
>> 3. How I can prevent this from happening in the future.
>>
>> Thanks in advance if anybody can point to something I will greatly
>> appreciate it.
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

kimchy
Administrator
Strange that multicast worked..., as far as I know its not supported in Rackspace. Yea, you should put all the IPs in the unicast list, its recommended if possible. If you go up to 3 nodes, then I think it make sense to have minimum master nodes set to 2, yea.


On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese <[hidden email]> wrote:
I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

1. Should I put all of the IPs of all of the nodes in there?
2. If I go up to a 3 node cluster - should I put "minimum_master_nodes" to 2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon <[hidden email]> wrote:
> Heya:
>
> First, regarding the split brain, it can obviously happen, especially with 2
> servers. If the network gets disconnected between the two for example. In
> this case, you will have two separate one node cluster and you will need to
> resolve it yourself by restarting one of them. You should see in the logs
> the fact that one node got disconnected from the other. If you had a larger
> cluster you could have defined "minimum_master_nodes" parameter to reduce
> the chances of it happening.
>
> I am not sure why when you restart the node its not finding the other node.
> I am assuming you are using unicast discovery, are you sure its configured
> properly? You can set discovery: TRACE in the logging.yml file to see which
> nodes it tries to ping and what the status of that is.
>
> -shay.banon
>
>
> On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]> wrote:
>>
>> I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
>> the last 3 weeks. It's been working great and we've had several
>> million records inserted and deleted in that time.
>>
>> Here is the config:
>>
>> http://d.pr/WbU7
>>
>> Yesterday I updated the boxes to 0.18.6 (and was trying to get the
>> boxes to log to syslog as well) - it appears that something didn't
>> work so well during the upgrade and I was left with 2 boxes both
>> thinking that they're masters.
>>
>> Here are the logs from the boxes during the upgrade:
>>
>> http://d.pr/HHbC
>> http://d.pr/n3jk
>>
>> And then from the next day:
>>
>> http://d.pr/QEkU
>> http://d.pr/I7fo
>>
>> I tried to get them to re-connect, but I couldn't get anything to work
>> correctly - they were both completely separate.
>>
>> I now have a single box with the correct index:
>>
>> http://d.pr/4TTq
>>
>> I have a chef recipe that builds new elasticsearch boxes so I spun up
>> a new box and tried to get it to join the cluster, but no dice - it's
>> like none of the other boxes exist. I've also tried to go back down to
>> 0.18.5 - no dice.
>>
>> Is there a way I can point a new box at that master directly and say:
>> "Hey you're a slave, there is the master."
>>
>> I'm a bit of a loss here - and don't want to admit defeat - but I'm a
>> little lost here.
>>
>> It's a system that we're building and I CAN lose the data, but I
>> really want to understand:
>>
>> 1. Why this happened.
>> 2. How I can recover from this.
>> 3. How I can prevent this from happening in the future.
>>
>> Thanks in advance if anybody can point to something I will greatly
>> appreciate it.
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

Darron Froese
Yeah - it was working great - all configs are in a git repo and pushed
out via chef - been working great for a little over 3 weeks in
production - and a couple weeks before in testing.

I have a ticket into Rackspace to see if they have changed something -
but will just switch to unicast now.

Thanks for your help - will be updating my configs now.

On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon <[hidden email]> wrote:

> Strange that multicast worked..., as far as I know its not supported in
> Rackspace. Yea, you should put all the IPs in the unicast list, its
> recommended if possible. If you go up to 3 nodes, then I think it make sense
> to have minimum master nodes set to 2, yea.
>
>
> On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese <[hidden email]> wrote:
>>
>> I was using multicast discovery and it was working great before -
>> heres the log with extra debugging:
>>
>> http://d.pr/9pWC
>>
>> It looks like it didn't find anything at all.
>>
>> So I setup unicast and added the master to the config:
>>
>> http://d.pr/NxBE
>>
>> And it seemed to work:
>>
>> http://d.pr/jxjW
>> http://d.pr/Ubzi
>>
>> I can switch to using unicast - that's no problem - just not sure why
>> that happened - maybe Rackspace made some network changes - not sure
>> why it worked for almost a month and suddenly stopped.
>>
>> A couple questions.
>>
>> 1. Should I put all of the IPs of all of the nodes in there?
>> 2. If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
>> 2?
>>
>> Thanks for the tips Shay - really appreciate it.
>>
>> On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon <[hidden email]> wrote:
>> > Heya:
>> >
>> > First, regarding the split brain, it can obviously happen, especially
>> > with 2
>> > servers. If the network gets disconnected between the two for example.
>> > In
>> > this case, you will have two separate one node cluster and you will need
>> > to
>> > resolve it yourself by restarting one of them. You should see in the
>> > logs
>> > the fact that one node got disconnected from the other. If you had a
>> > larger
>> > cluster you could have defined "minimum_master_nodes" parameter to
>> > reduce
>> > the chances of it happening.
>> >
>> > I am not sure why when you restart the node its not finding the other
>> > node.
>> > I am assuming you are using unicast discovery, are you sure its
>> > configured
>> > properly? You can set discovery: TRACE in the logging.yml file to see
>> > which
>> > nodes it tries to ping and what the status of that is.
>> >
>> > -shay.banon
>> >
>> >
>> > On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]>
>> > wrote:
>> >>
>> >> I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
>> >> the last 3 weeks. It's been working great and we've had several
>> >> million records inserted and deleted in that time.
>> >>
>> >> Here is the config:
>> >>
>> >> http://d.pr/WbU7
>> >>
>> >> Yesterday I updated the boxes to 0.18.6 (and was trying to get the
>> >> boxes to log to syslog as well) - it appears that something didn't
>> >> work so well during the upgrade and I was left with 2 boxes both
>> >> thinking that they're masters.
>> >>
>> >> Here are the logs from the boxes during the upgrade:
>> >>
>> >> http://d.pr/HHbC
>> >> http://d.pr/n3jk
>> >>
>> >> And then from the next day:
>> >>
>> >> http://d.pr/QEkU
>> >> http://d.pr/I7fo
>> >>
>> >> I tried to get them to re-connect, but I couldn't get anything to work
>> >> correctly - they were both completely separate.
>> >>
>> >> I now have a single box with the correct index:
>> >>
>> >> http://d.pr/4TTq
>> >>
>> >> I have a chef recipe that builds new elasticsearch boxes so I spun up
>> >> a new box and tried to get it to join the cluster, but no dice - it's
>> >> like none of the other boxes exist. I've also tried to go back down to
>> >> 0.18.5 - no dice.
>> >>
>> >> Is there a way I can point a new box at that master directly and say:
>> >> "Hey you're a slave, there is the master."
>> >>
>> >> I'm a bit of a loss here - and don't want to admit defeat - but I'm a
>> >> little lost here.
>> >>
>> >> It's a system that we're building and I CAN lose the data, but I
>> >> really want to understand:
>> >>
>> >> 1. Why this happened.
>> >> 2. How I can recover from this.
>> >> 3. How I can prevent this from happening in the future.
>> >>
>> >> Thanks in advance if anybody can point to something I will greatly
>> >> appreciate it.
>> >
>> >
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

Darron Froese
FYI - heard from Rackspace:

"We did recently implement multicast filtering on our new XenServer
Linux deployments. This was originally the intended design as having
multicast between all customers in the same huddles can be
problematic. I apologize that you used this as a feature before it was
blocked, but I feel it may ultimately the best with multicast
filtered.

Your timeline corresponds exactly with when I heard that the changes
were being rolled out."

Oh well - makes sense now.

On Fri, Dec 30, 2011 at 2:42 PM, Darron Froese <[hidden email]> wrote:

> Yeah - it was working great - all configs are in a git repo and pushed
> out via chef - been working great for a little over 3 weeks in
> production - and a couple weeks before in testing.
>
> I have a ticket into Rackspace to see if they have changed something -
> but will just switch to unicast now.
>
> Thanks for your help - will be updating my configs now.
>
> On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon <[hidden email]> wrote:
>> Strange that multicast worked..., as far as I know its not supported in
>> Rackspace. Yea, you should put all the IPs in the unicast list, its
>> recommended if possible. If you go up to 3 nodes, then I think it make sense
>> to have minimum master nodes set to 2, yea.
>>
>>
>> On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese <[hidden email]> wrote:
>>>
>>> I was using multicast discovery and it was working great before -
>>> heres the log with extra debugging:
>>>
>>> http://d.pr/9pWC
>>>
>>> It looks like it didn't find anything at all.
>>>
>>> So I setup unicast and added the master to the config:
>>>
>>> http://d.pr/NxBE
>>>
>>> And it seemed to work:
>>>
>>> http://d.pr/jxjW
>>> http://d.pr/Ubzi
>>>
>>> I can switch to using unicast - that's no problem - just not sure why
>>> that happened - maybe Rackspace made some network changes - not sure
>>> why it worked for almost a month and suddenly stopped.
>>>
>>> A couple questions.
>>>
>>> 1. Should I put all of the IPs of all of the nodes in there?
>>> 2. If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
>>> 2?
>>>
>>> Thanks for the tips Shay - really appreciate it.
>>>
>>> On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon <[hidden email]> wrote:
>>> > Heya:
>>> >
>>> > First, regarding the split brain, it can obviously happen, especially
>>> > with 2
>>> > servers. If the network gets disconnected between the two for example.
>>> > In
>>> > this case, you will have two separate one node cluster and you will need
>>> > to
>>> > resolve it yourself by restarting one of them. You should see in the
>>> > logs
>>> > the fact that one node got disconnected from the other. If you had a
>>> > larger
>>> > cluster you could have defined "minimum_master_nodes" parameter to
>>> > reduce
>>> > the chances of it happening.
>>> >
>>> > I am not sure why when you restart the node its not finding the other
>>> > node.
>>> > I am assuming you are using unicast discovery, are you sure its
>>> > configured
>>> > properly? You can set discovery: TRACE in the logging.yml file to see
>>> > which
>>> > nodes it tries to ping and what the status of that is.
>>> >
>>> > -shay.banon
>>> >
>>> >
>>> > On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]>
>>> > wrote:
>>> >>
>>> >> I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
>>> >> the last 3 weeks. It's been working great and we've had several
>>> >> million records inserted and deleted in that time.
>>> >>
>>> >> Here is the config:
>>> >>
>>> >> http://d.pr/WbU7
>>> >>
>>> >> Yesterday I updated the boxes to 0.18.6 (and was trying to get the
>>> >> boxes to log to syslog as well) - it appears that something didn't
>>> >> work so well during the upgrade and I was left with 2 boxes both
>>> >> thinking that they're masters.
>>> >>
>>> >> Here are the logs from the boxes during the upgrade:
>>> >>
>>> >> http://d.pr/HHbC
>>> >> http://d.pr/n3jk
>>> >>
>>> >> And then from the next day:
>>> >>
>>> >> http://d.pr/QEkU
>>> >> http://d.pr/I7fo
>>> >>
>>> >> I tried to get them to re-connect, but I couldn't get anything to work
>>> >> correctly - they were both completely separate.
>>> >>
>>> >> I now have a single box with the correct index:
>>> >>
>>> >> http://d.pr/4TTq
>>> >>
>>> >> I have a chef recipe that builds new elasticsearch boxes so I spun up
>>> >> a new box and tried to get it to join the cluster, but no dice - it's
>>> >> like none of the other boxes exist. I've also tried to go back down to
>>> >> 0.18.5 - no dice.
>>> >>
>>> >> Is there a way I can point a new box at that master directly and say:
>>> >> "Hey you're a slave, there is the master."
>>> >>
>>> >> I'm a bit of a loss here - and don't want to admit defeat - but I'm a
>>> >> little lost here.
>>> >>
>>> >> It's a system that we're building and I CAN lose the data, but I
>>> >> really want to understand:
>>> >>
>>> >> 1. Why this happened.
>>> >> 2. How I can recover from this.
>>> >> 3. How I can prevent this from happening in the future.
>>> >>
>>> >> Thanks in advance if anybody can point to something I will greatly
>>> >> appreciate it.
>>> >
>>> >
>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

kimchy
Administrator
Interesting!

On Wed, Jan 4, 2012 at 10:26 AM, Darron Froese <[hidden email]> wrote:
FYI - heard from Rackspace:

"We did recently implement multicast filtering on our new XenServer
Linux deployments. This was originally the intended design as having
multicast between all customers in the same huddles can be
problematic. I apologize that you used this as a feature before it was
blocked, but I feel it may ultimately the best with multicast
filtered.

Your timeline corresponds exactly with when I heard that the changes
were being rolled out."

Oh well - makes sense now.

On Fri, Dec 30, 2011 at 2:42 PM, Darron Froese <[hidden email]> wrote:
> Yeah - it was working great - all configs are in a git repo and pushed
> out via chef - been working great for a little over 3 weeks in
> production - and a couple weeks before in testing.
>
> I have a ticket into Rackspace to see if they have changed something -
> but will just switch to unicast now.
>
> Thanks for your help - will be updating my configs now.
>
> On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon <[hidden email]> wrote:
>> Strange that multicast worked..., as far as I know its not supported in
>> Rackspace. Yea, you should put all the IPs in the unicast list, its
>> recommended if possible. If you go up to 3 nodes, then I think it make sense
>> to have minimum master nodes set to 2, yea.
>>
>>
>> On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese <[hidden email]> wrote:
>>>
>>> I was using multicast discovery and it was working great before -
>>> heres the log with extra debugging:
>>>
>>> http://d.pr/9pWC
>>>
>>> It looks like it didn't find anything at all.
>>>
>>> So I setup unicast and added the master to the config:
>>>
>>> http://d.pr/NxBE
>>>
>>> And it seemed to work:
>>>
>>> http://d.pr/jxjW
>>> http://d.pr/Ubzi
>>>
>>> I can switch to using unicast - that's no problem - just not sure why
>>> that happened - maybe Rackspace made some network changes - not sure
>>> why it worked for almost a month and suddenly stopped.
>>>
>>> A couple questions.
>>>
>>> 1. Should I put all of the IPs of all of the nodes in there?
>>> 2. If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
>>> 2?
>>>
>>> Thanks for the tips Shay - really appreciate it.
>>>
>>> On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon <[hidden email]> wrote:
>>> > Heya:
>>> >
>>> > First, regarding the split brain, it can obviously happen, especially
>>> > with 2
>>> > servers. If the network gets disconnected between the two for example.
>>> > In
>>> > this case, you will have two separate one node cluster and you will need
>>> > to
>>> > resolve it yourself by restarting one of them. You should see in the
>>> > logs
>>> > the fact that one node got disconnected from the other. If you had a
>>> > larger
>>> > cluster you could have defined "minimum_master_nodes" parameter to
>>> > reduce
>>> > the chances of it happening.
>>> >
>>> > I am not sure why when you restart the node its not finding the other
>>> > node.
>>> > I am assuming you are using unicast discovery, are you sure its
>>> > configured
>>> > properly? You can set discovery: TRACE in the logging.yml file to see
>>> > which
>>> > nodes it tries to ping and what the status of that is.
>>> >
>>> > -shay.banon
>>> >
>>> >
>>> > On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]>
>>> > wrote:
>>> >>
>>> >> I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
>>> >> the last 3 weeks. It's been working great and we've had several
>>> >> million records inserted and deleted in that time.
>>> >>
>>> >> Here is the config:
>>> >>
>>> >> http://d.pr/WbU7
>>> >>
>>> >> Yesterday I updated the boxes to 0.18.6 (and was trying to get the
>>> >> boxes to log to syslog as well) - it appears that something didn't
>>> >> work so well during the upgrade and I was left with 2 boxes both
>>> >> thinking that they're masters.
>>> >>
>>> >> Here are the logs from the boxes during the upgrade:
>>> >>
>>> >> http://d.pr/HHbC
>>> >> http://d.pr/n3jk
>>> >>
>>> >> And then from the next day:
>>> >>
>>> >> http://d.pr/QEkU
>>> >> http://d.pr/I7fo
>>> >>
>>> >> I tried to get them to re-connect, but I couldn't get anything to work
>>> >> correctly - they were both completely separate.
>>> >>
>>> >> I now have a single box with the correct index:
>>> >>
>>> >> http://d.pr/4TTq
>>> >>
>>> >> I have a chef recipe that builds new elasticsearch boxes so I spun up
>>> >> a new box and tried to get it to join the cluster, but no dice - it's
>>> >> like none of the other boxes exist. I've also tried to go back down to
>>> >> 0.18.5 - no dice.
>>> >>
>>> >> Is there a way I can point a new box at that master directly and say:
>>> >> "Hey you're a slave, there is the master."
>>> >>
>>> >> I'm a bit of a loss here - and don't want to admit defeat - but I'm a
>>> >> little lost here.
>>> >>
>>> >> It's a system that we're building and I CAN lose the data, but I
>>> >> really want to understand:
>>> >>
>>> >> 1. Why this happened.
>>> >> 2. How I can recover from this.
>>> >> 3. How I can prevent this from happening in the future.
>>> >>
>>> >> Thanks in advance if anybody can point to something I will greatly
>>> >> appreciate it.
>>> >
>>> >
>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Split brain?

Stanislas Polu
Very interesting! Thanks!

-stan

--
Stanislas Polu 
Mo: +33 6 83 71 90 04 | Tw: @spolu | http://teleportd.com | Realtime Photo Search



On Wed, Jan 4, 2012 at 12:16 PM, Shay Banon <[hidden email]> wrote:
Interesting!


On Wed, Jan 4, 2012 at 10:26 AM, Darron Froese <[hidden email]> wrote:
FYI - heard from Rackspace:

"We did recently implement multicast filtering on our new XenServer
Linux deployments. This was originally the intended design as having
multicast between all customers in the same huddles can be
problematic. I apologize that you used this as a feature before it was
blocked, but I feel it may ultimately the best with multicast
filtered.

Your timeline corresponds exactly with when I heard that the changes
were being rolled out."

Oh well - makes sense now.

On Fri, Dec 30, 2011 at 2:42 PM, Darron Froese <[hidden email]> wrote:
> Yeah - it was working great - all configs are in a git repo and pushed
> out via chef - been working great for a little over 3 weeks in
> production - and a couple weeks before in testing.
>
> I have a ticket into Rackspace to see if they have changed something -
> but will just switch to unicast now.
>
> Thanks for your help - will be updating my configs now.
>
> On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon <[hidden email]> wrote:
>> Strange that multicast worked..., as far as I know its not supported in
>> Rackspace. Yea, you should put all the IPs in the unicast list, its
>> recommended if possible. If you go up to 3 nodes, then I think it make sense
>> to have minimum master nodes set to 2, yea.
>>
>>
>> On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese <[hidden email]> wrote:
>>>
>>> I was using multicast discovery and it was working great before -
>>> heres the log with extra debugging:
>>>
>>> http://d.pr/9pWC
>>>
>>> It looks like it didn't find anything at all.
>>>
>>> So I setup unicast and added the master to the config:
>>>
>>> http://d.pr/NxBE
>>>
>>> And it seemed to work:
>>>
>>> http://d.pr/jxjW
>>> http://d.pr/Ubzi
>>>
>>> I can switch to using unicast - that's no problem - just not sure why
>>> that happened - maybe Rackspace made some network changes - not sure
>>> why it worked for almost a month and suddenly stopped.
>>>
>>> A couple questions.
>>>
>>> 1. Should I put all of the IPs of all of the nodes in there?
>>> 2. If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
>>> 2?
>>>
>>> Thanks for the tips Shay - really appreciate it.
>>>
>>> On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon <[hidden email]> wrote:
>>> > Heya:
>>> >
>>> > First, regarding the split brain, it can obviously happen, especially
>>> > with 2
>>> > servers. If the network gets disconnected between the two for example.
>>> > In
>>> > this case, you will have two separate one node cluster and you will need
>>> > to
>>> > resolve it yourself by restarting one of them. You should see in the
>>> > logs
>>> > the fact that one node got disconnected from the other. If you had a
>>> > larger
>>> > cluster you could have defined "minimum_master_nodes" parameter to
>>> > reduce
>>> > the chances of it happening.
>>> >
>>> > I am not sure why when you restart the node its not finding the other
>>> > node.
>>> > I am assuming you are using unicast discovery, are you sure its
>>> > configured
>>> > properly? You can set discovery: TRACE in the logging.yml file to see
>>> > which
>>> > nodes it tries to ping and what the status of that is.
>>> >
>>> > -shay.banon
>>> >
>>> >
>>> > On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <[hidden email]>
>>> > wrote:
>>> >>
>>> >> I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
>>> >> the last 3 weeks. It's been working great and we've had several
>>> >> million records inserted and deleted in that time.
>>> >>
>>> >> Here is the config:
>>> >>
>>> >> http://d.pr/WbU7
>>> >>
>>> >> Yesterday I updated the boxes to 0.18.6 (and was trying to get the
>>> >> boxes to log to syslog as well) - it appears that something didn't
>>> >> work so well during the upgrade and I was left with 2 boxes both
>>> >> thinking that they're masters.
>>> >>
>>> >> Here are the logs from the boxes during the upgrade:
>>> >>
>>> >> http://d.pr/HHbC
>>> >> http://d.pr/n3jk
>>> >>
>>> >> And then from the next day:
>>> >>
>>> >> http://d.pr/QEkU
>>> >> http://d.pr/I7fo
>>> >>
>>> >> I tried to get them to re-connect, but I couldn't get anything to work
>>> >> correctly - they were both completely separate.
>>> >>
>>> >> I now have a single box with the correct index:
>>> >>
>>> >> http://d.pr/4TTq
>>> >>
>>> >> I have a chef recipe that builds new elasticsearch boxes so I spun up
>>> >> a new box and tried to get it to join the cluster, but no dice - it's
>>> >> like none of the other boxes exist. I've also tried to go back down to
>>> >> 0.18.5 - no dice.
>>> >>
>>> >> Is there a way I can point a new box at that master directly and say:
>>> >> "Hey you're a slave, there is the master."
>>> >>
>>> >> I'm a bit of a loss here - and don't want to admit defeat - but I'm a
>>> >> little lost here.
>>> >>
>>> >> It's a system that we're building and I CAN lose the data, but I
>>> >> really want to understand:
>>> >>
>>> >> 1. Why this happened.
>>> >> 2. How I can recover from this.
>>> >> 3. How I can prevent this from happening in the future.
>>> >>
>>> >> Thanks in advance if anybody can point to something I will greatly
>>> >> appreciate it.
>>> >
>>> >
>>
>>


Loading...