Quantcast

ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, guys:
  These days through the test, I found ES bugs in 0.20.4 and 0.20.5 which cause shards allocation failure and stuck in initializing state.
  The following are my test steps:
  1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3 nodes are master nodes, 2 nodes are load balancer, 15 nodes are data nodes
  2) After the cluster is up, I tried to create some empty indices for example index-2013-02-25, index-2013-02-26, index-2013-02-27, index-2013-03-01, etc
  But some shards stuck in initializing status for long time.
{
  "cluster_name" : "es-test",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 105,
  "active_shards" : 201,
  "relocating_shards" : 0,
  "initializing_shards" : 9,
  "unassigned_shards" : 0
}
  Moreover when I created index-2013-03-02, the cluster became to red.
{
  "cluster_name" : "es-test",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 119,
  "active_shards" : 228,
  "relocating_shards" : 0,
  "initializing_shards" : 11,
  "unassigned_shards" : 1
}

  I set the log level to trace, and checked the logs, no error logs are shown. but from the logs, I can know some shards are initializing and unassigned.
  And I just tried 0.20.5, the same problem happened.
  But for 0.19.11, the problem disappeared. All the empty indices can be created successfully instantly even for some strange index names.
  So I guess ES have some bugs for 0.20.4 and 0.20.5.
  Can Kimchy or any other es experts have a check this problem?
  Thank you very much!
 
-Regards-
-Dong Aihua-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Dong Aihua
By the way, I tested the single node for 0.20.4, 0.20.5, such problem doesn't happen. 

在 2013年2月27日星期三UTC+9上午9时09分00秒,jackiedong写道:
Hi, guys:
  These days through the test, I found ES bugs in 0.20.4 and 0.20.5 which cause shards allocation failure and stuck in initializing state.
  The following are my test steps:
  1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3 nodes are master nodes, 2 nodes are load balancer, 15 nodes are data nodes
  2) After the cluster is up, I tried to create some empty indices for example index-2013-02-25, index-2013-02-26, index-2013-02-27, index-2013-03-01, etc
  But some shards stuck in initializing status for long time.
{
  "cluster_name" : "es-test",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 105,
  "active_shards" : 201,
  "relocating_shards" : 0,
  "initializing_shards" : 9,
  "unassigned_shards" : 0
}
  Moreover when I created index-2013-03-02, the cluster became to red.
{
  "cluster_name" : "es-test",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 119,
  "active_shards" : 228,
  "relocating_shards" : 0,
  "initializing_shards" : 11,
  "unassigned_shards" : 1
}

  I set the log level to trace, and checked the logs, no error logs are shown. but from the logs, I can know some shards are initializing and unassigned.
  And I just tried 0.20.5, the same problem happened.
  But for 0.19.11, the problem disappeared. All the empty indices can be created successfully instantly even for some strange index names.
  So I guess ES have some bugs for 0.20.4 and 0.20.5.
  Can Kimchy or any other es experts have a check this problem?
  Thank you very much!
 
-Regards-
-Dong Aihua-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Clinton Gormley-2
In reply to this post by Dong Aihua
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

simonw-2
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  3) curl -XPUT 10.96.250.214:10200/abcd1234
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  4) curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Dong Aihua
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4 and 0.20.5 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, Simon
  Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gz from https://github.com/dongaihua/shares
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

simonw-2
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Ivan Brusic
Hi Simon,

Which commits were they? 9a25867bfe154357165c87a7b509029ff832efa4? Curious to see what has changed.

I have not looked at Dong's logs, but we have also experienced nodes being removed from a cluster although the process is running. One possible culprit is unresponsiveness due to GC.

-- 
Ivan


On Thu, Mar 7, 2013 at 3:04 AM, simonw <[hidden email]> wrote:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon


On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
In reply to this post by simonw-2
Hi, Simon:
  The problem is if I switched to 0.19.11 version and with the same environment, same test steps, the cluster works fine.
  In fact, we have two test clusters(one is this which have 20 nodes, another cluster has 25 nodes). Both clusters have the same problem. If using 0.19.11 version, both clusters work fine. But if using 0.20.4, 0.20.5 or 0.90.0.Beta1, both clusters have the same problem.
  I also tried to reproduce the problem with 2 nodes with several instances, the problem didn't happen.
  My test step is very simple. Just setup up the cluster and try to create 1~3 empty indices.
  In fact, this problem caused our system crashed after we upgraded es from 0.19.11 to 0.20.4 and all the data are lost.
  Thank you for your response.
@other people,  By the way, did others setup up the large cluster with 0.20.4, 0.20.5 or 0.90.0.Beta1? Can you share your result?

-Regards-
-Dong Aihua-

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
In reply to this post by simonw-2
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

simonw-2
In reply to this post by Ivan Brusic


On Thursday, March 7, 2013 10:38:30 PM UTC+1, Ivan Brusic wrote:
Hi Simon,

Which commits were they? 9a25867bfe154357165c87a7b509029ff832efa4? Curious to see what has changed.

yeah that is what I referred to: https://github.com/elasticsearch/elasticsearch/commit/9a25867bfe154357165c87a7b509029ff832efa4

simon 

I have not looked at Dong's logs, but we have also experienced nodes being removed from a cluster although the process is running. One possible culprit is unresponsiveness due to GC.

-- 
Ivan


On Thu, Mar 7, 2013 at 3:04 AM, simonw <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="ssj3LmVmvt0J">simon.w...@elasticsearch.com> wrote:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon


On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="ssj3LmVmvt0J">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

simonw-2
In reply to this post by Dong Aihua
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, Simon:
  Today I tested 0.19.12, it is fine.
  Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. 
  The test steps are as before. And the following are some logs:
{
  "cluster_name" : "test-0.20.0",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 224,
  "active_shards" : 445,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
  master_node: 
   { name: 'xseed021.kdev',
     transport_address: 'inet[/10.96.250.211:9400]',
     attributes: { data: 'false', master: 'true' } },
  initShards: 
   [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2012-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: false,
       shard: 11,
       index: 'test-2013-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: true,
       shard: 7,
       index: 'testx-xxx-2013-zzz' },
     { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2013-yyy' } ],
  unassigned_shards_total: 1,
  unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
  The detail logs are located at https://github.com/dongaihua/shares/test-0.20.0_LOGS.tar.gz
  By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response.
  Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, Simon:
  Today I tested the latest ES version 0.20.6, the problem seems fixed in this version.
  Can you confirm it? And I'm wondering how do you fix it?
  Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:
Hi, Simon:
  Today I tested 0.19.12, it is fine.
  Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. 
  The test steps are as before. And the following are some logs:
{
  "cluster_name" : "test-0.20.0",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 224,
  "active_shards" : 445,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
  master_node: 
   { name: 'xseed021.kdev',
     transport_address: 'inet[/10.96.250.211:9400]',
     attributes: { data: 'false', master: 'true' } },
  initShards: 
   [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2012-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: false,
       shard: 11,
       index: 'test-2013-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: true,
       shard: 7,
       index: 'testx-xxx-2013-zzz' },
     { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2013-yyy' } ],
  unassigned_shards_total: 1,
  unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
  By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response.
  Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

simonw-2
Hmm that is interesting... I suspect that the fix for the missed ChannelClosed events fixed it then though....I don't have you logs around anymore but did you see any TooManyOpenFiles exceptions by any chance?  Can you confirm that the latest master has fixed this as well by running a master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:
Hi, Simon:
  Today I tested the latest ES version 0.20.6, the problem seems fixed in this version.
  Can you confirm it? And I'm wondering how do you fix it?
  Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:
Hi, Simon:
  Today I tested 0.19.12, it is fine.
  Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. 
  The test steps are as before. And the following are some logs:
{
  "cluster_name" : "test-0.20.0",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 224,
  "active_shards" : 445,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
  master_node: 
   { name: 'xseed021.kdev',
     transport_address: 'inet[/10.96.250.211:9400]',
     attributes: { data: 'false', master: 'true' } },
  initShards: 
   [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2012-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: false,
       shard: 11,
       index: 'test-2013-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: true,
       shard: 7,
       index: 'testx-xxx-2013-zzz' },
     { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2013-yyy' } ],
  unassigned_shards_total: 1,
  unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
  By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response.
  Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:
Hmm that is interesting... I suspect that the fix for the missed ChannelClosed events fixed it then though....I don't have you logs around anymore but did you see any TooManyOpenFiles exceptions by any chance?  Can you confirm that the latest master has fixed this as well by running a master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:
Hi, Simon:
  Today I tested the latest ES version 0.20.6, the problem seems fixed in this version.
  Can you confirm it? And I'm wondering how do you fix it?
  Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:
Hi, Simon:
  Today I tested 0.19.12, it is fine.
  Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. 
  The test steps are as before. And the following are some logs:
{
  "cluster_name" : "test-0.20.0",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 224,
  "active_shards" : 445,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
  master_node: 
   { name: 'xseed021.kdev',
     transport_address: 'inet[/10.96.250.211:9400]',
     attributes: { data: 'false', master: 'true' } },
  initShards: 
   [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2012-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: false,
       shard: 11,
       index: 'test-2013-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: true,
       shard: 7,
       index: 'testx-xxx-2013-zzz' },
     { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2013-yyy' } ],
  unassigned_shards_total: 1,
  unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
  By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response.
  Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, Simon:
  I tested the latest trunk code, it also has no problem. The logs are located at [hidden email]:dongaihua/shares.git
  By the way, we ever met TooManyOpenFiles problem, but after we change the ulimit -n <number>, we don't meet that problems again.
  After you check the logs, can you give me a confirmation if the problem is really resolved?
  Thank you.


在 2013年3月28日星期四UTC+8下午4时36分00秒,Dong Aihua写道:
Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:
Hmm that is interesting... I suspect that the fix for the missed ChannelClosed events fixed it then though....I don't have you logs around anymore but did you see any TooManyOpenFiles exceptions by any chance?  Can you confirm that the latest master has fixed this as well by running a master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:
Hi, Simon:
  Today I tested the latest ES version 0.20.6, the problem seems fixed in this version.
  Can you confirm it? And I'm wondering how do you fix it?
  Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:
Hi, Simon:
  Today I tested 0.19.12, it is fine.
  Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. 
  The test steps are as before. And the following are some logs:
{
  "cluster_name" : "test-0.20.0",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 224,
  "active_shards" : 445,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
  master_node: 
   { name: 'xseed021.kdev',
     transport_address: 'inet[/10.96.250.211:9400]',
     attributes: { data: 'false', master: 'true' } },
  initShards: 
   [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2012-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: false,
       shard: 11,
       index: 'test-2013-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: true,
       shard: 7,
       index: 'testx-xxx-2013-zzz' },
     { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2013-yyy' } ],
  unassigned_shards_total: 1,
  unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
  By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response.
  Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ES bugs in 0.20.4, 0.20.5 and 0.90.0.Beta1 cause shards allocation failure and stuck in initializing state

Dong Aihua
Hi, Simon:
  Today, I upgraded the es cluster from 0.19.11 to 0.20.6 successfully. The upgrade is smooth.
  And this problem doesn't see again.
  Thank you.

在 2013年4月1日星期一UTC+8上午11时21分49秒,Dong Aihua写道:
Hi, Simon:
  I tested the latest trunk code, it also has no problem. The logs are located at [hidden email]:dongaihua/shares.git
  By the way, we ever met TooManyOpenFiles problem, but after we change the ulimit -n <number>, we don't meet that problems again.
  After you check the logs, can you give me a confirmation if the problem is really resolved?
  Thank you.


在 2013年3月28日星期四UTC+8下午4时36分00秒,Dong Aihua写道:
Ok, I will test it and tell you the result and logs.

在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道:
Hmm that is interesting... I suspect that the fix for the missed ChannelClosed events fixed it then though....I don't have you logs around anymore but did you see any TooManyOpenFiles exceptions by any chance?  Can you confirm that the latest master has fixed this as well by running a master version / latest build?

simon

On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote:
Hi, Simon:
  Today I tested the latest ES version 0.20.6, the problem seems fixed in this version.
  Can you confirm it? And I'm wondering how do you fix it?
  Thanks a lot.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道:
Hi, Simon:
  Today I tested 0.19.12, it is fine.
  Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. 
  The test steps are as before. And the following are some logs:
{
  "cluster_name" : "test-0.20.0",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 224,
  "active_shards" : 445,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 1
}
{ cluster_name: 'test-0.20.0',
  master_node: 
   { name: 'xseed021.kdev',
     transport_address: 'inet[/10.96.250.211:9400]',
     attributes: { data: 'false', master: 'true' } },
  initShards: 
   [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2012-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: false,
       shard: 11,
       index: 'test-2013-03-11' },
     { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}',
       primary: true,
       shard: 7,
       index: 'testx-xxx-2013-zzz' },
     { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}',
       primary: false,
       shard: 5,
       index: 'testx-xxx-2013-yyy' } ],
  unassigned_shards_total: 1,
  unassigned_indices: [ 'testx-xxx-2013-zzz' ] }
  By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response.
  Thank you.

-Regards-
-dongaihua-

在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道:
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?

simon

On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote:
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish

在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道:
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? 
What is your setup by the way, any idea why servers disconnect all the time?

simon

On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote:
Hi, :
  Is there any clue for this problem's root cause?
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道:
Hi, Simon
  Thank you.

-Regards-
-Dong aihua-

在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道:
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.

在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道:
Hi, Simon:
  I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_action_timeout: 30s.
  The same problem happened the again.
  The configuration is same as before. The following are my test steps:
  1) curl -XPUT 10.96.250.214:10200/test1
  {"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 28,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
  2)  curl -XPUT 10.96.250.214:10200/1234abcd
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 30,
  "active_shards" : 57,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 45,
  "active_shards" : 87,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time
{
  "cluster_name" : "es-test2-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 59,
  "active_shards" : 112,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 1
}
  The cluster stayed in this state for long time.
  The detail logs of 20 nodes are attached.
  Thank you.

-Regards-
-Dong Aihua-

在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道:
Hey,

thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s"  
So we can see if this happens just because of the closed connections here?

simon

On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote:
Hi, Clint:
  Today I tested the latest the ES version 0.90.0.Beta1, it has the same problem.
  My test configure is as following:
  20 nodes. 10.96.250.211,212,213 are master nodes. 211 is the leader; 214 and 215 are load balancer; other 15 nodes are data nodes.
  My test step is as following:
  1) after the cluster is up, I created an empty index: test1
{"ok":true,"acknowledged":true}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 15,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 3,
  "unassigned_shards" : 0
}
  The cluster stayed in this state for long time.
 2) Then I created another empty index: abcd1234
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 29,
  "active_shards" : 53,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 1
  The cluster stayed in this state for long time
 3) Then I created one more empty index: 
{"ok":true,"acknowledged":false}
{
  "cluster_name" : "es-test-0-90-0-beta1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 43,
  "active_shards" : 78,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unassigned_shards" : 6
}
  The cluster stayed in this state for long time.
  You can refer the detail logs from 20 nodes in the attachments.
   Thank you.

-Regards-
-Dong Aihua-


在 2013年2月27日星期三UTC+8下午6时51分49秒,Clinton Gormley写道:
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:

> Hi, guys:
>   These days through the test, I found ES bugs in 0.20.4 and 0.20.5
> which cause shards allocation failure and stuck in initializing state.
>   The following are my test steps:
>   1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3
> nodes are master nodes, 2 nodes are load balancer, 15 nodes are data
> nodes
>   2) After the cluster is up, I tried to create some empty indices for
> example index-2013-02-25, index-2013-02-26, index-2013-02-27,
> index-2013-03-01, etc
>   But some shards stuck in initializing status for long time.

Please can you open an issue on github, with a full recreation of what
you need to do to recreate this problem, plus all the logs from all of
the nodes.

ta

Clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
12
Loading...