|
Hi, guys:
These days through the test, I found ES bugs in 0.20.4 and 0.20.5 which cause shards allocation failure and stuck in initializing state. The following are my test steps: 1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3 nodes are master nodes, 2 nodes are load balancer, 15 nodes are data nodes 2) After the cluster is up, I tried to create some empty indices for example index-2013-02-25, index-2013-02-26, index-2013-02-27, index-2013-03-01, etc But some shards stuck in initializing status for long time. { "cluster_name" : "es-test", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 105, "active_shards" : 201, "relocating_shards" : 0, "initializing_shards" : 9, "unassigned_shards" : 0 } Moreover when I created index-2013-03-02, the cluster became to red. { "cluster_name" : "es-test", "status" : "red", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 119, "active_shards" : 228, "relocating_shards" : 0, "initializing_shards" : 11, "unassigned_shards" : 1 } I set the log level to trace, and checked the logs, no error logs are shown. but from the logs, I can know some shards are initializing and unassigned. And I just tried 0.20.5, the same problem happened. But for 0.19.11, the problem disappeared. All the empty indices can be created successfully instantly even for some strange index names. So I guess ES have some bugs for 0.20.4 and 0.20.5. Can Kimchy or any other es experts have a check this problem? Thank you very much! -Regards- -Dong Aihua- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
By the way, I tested the single node for 0.20.4, 0.20.5, such problem doesn't happen.
在 2013年2月27日星期三UTC+9上午9时09分00秒,jackiedong写道: Hi, guys:-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Dong Aihua
On Tue, 2013-02-26 at 16:09 -0800, jackiedong wrote:
> Hi, guys: > These days through the test, I found ES bugs in 0.20.4 and 0.20.5 > which cause shards allocation failure and stuck in initializing state. > The following are my test steps: > 1) I setup 20 nodes with 0.20.4, and bring a fresh cluster up. 3 > nodes are master nodes, 2 nodes are load balancer, 15 nodes are data > nodes > 2) After the cluster is up, I tried to create some empty indices for > example index-2013-02-25, index-2013-02-26, index-2013-02-27, > index-2013-03-01, etc > But some shards stuck in initializing status for long time. Please can you open an issue on github, with a full recreation of what you need to do to recreate this problem, plus all the logs from all of the nodes. ta Clint -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hey,
thanks for testing this. I can see some exception during recovery when starting a replica which causes one of the machines to wait for an answer but it doesn't come back and it doesn't seem to get notified that the connection is closed. Can you try to set: "indices.recovery.internal_action_timeout: 30s" So we can see if this happens just because of the closed connections here? simon On Monday, March 4, 2013 7:22:50 AM UTC+1, Dong Aihua wrote: Hi, Clint: You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, Simon:
I tested the ES version 0.90.0.Beta1 again with the setting indices.recovery.internal_ The same problem happened the again. The configuration is same as before. The following are my test steps: 1) curl -XPUT 10.96.250.214:10200/test1 {"ok":true,"acknowledged":true} { "cluster_name" : "es-test2-0-90-0-beta1", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 15, "active_shards" : 28, "relocating_shards" : 0, "initializing_shards" : 2, "unassigned_shards" : 0 } The cluster stayed in this state for long time 2) curl -XPUT 10.96.250.214:10200/1234abcd{"ok":true,"acknowledged":true} { "cluster_name" : "es-test2-0-90-0-beta1", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 30, "active_shards" : 57, "relocating_shards" : 0, "initializing_shards" : 3, "unassigned_shards" : 0 } 3) curl -XPUT 10.96.250.214:10200/abcd1234 { "cluster_name" : "es-test2-0-90-0-beta1", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 45, "active_shards" : 87, "relocating_shards" : 0, "initializing_shards" : 3, "unassigned_shards" : 0 } The cluster stayed in this state for long time 4) curl -XPUT 10.96.250.214:10200/xxxxxyyyyy111 { "cluster_name" : "es-test2-0-90-0-beta1", "status" : "red", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 59, "active_shards" : 112, "relocating_shards" : 0, "initializing_shards" : 7, "unassigned_shards" : 1 } The cluster stayed in this state for long time. The detail logs of 20 nodes are attached. Thank you. -Regards- -Dong Aihua- 在 2013年3月4日星期一UTC+8下午4时47分52秒,simonw写道: Hey,-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later.
在 2013年3月5日星期二UTC+9下午2时21分42秒,Dong Aihua写道: Hi, Simon:-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, Simon
Please get the logs es-test2-0-90-0-beta1_LOGS.tar.gz from https://github.com/dongaihua/shares Thank you. -Regards- -Dong aihua- 在 2013年3月5日星期二UTC+9下午2时43分15秒,Dong Aihua写道: I uploaded several times for the logs. All failed, I got the 340 error. I will upload the logs later. You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, :
Is there any clue for this problem's root cause? Thank you. -Regards- -Dong Aihua- 在 2013年3月5日星期二UTC+8下午2时42分55秒,Dong Aihua写道: Hi, Simon You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way?
What is your setup by the way, any idea why servers disconnect all the time? simon
-- On Thursday, March 7, 2013 3:11:57 AM UTC+1, Dong Aihua wrote: Hi, : You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi Simon, Which commits were they? 9a25867bfe154357165c87a7b509029ff832efa4? Curious to see what has changed. I have not looked at Dong's logs, but we have also experienced nodes being removed from a cluster although the process is running. One possible culprit is unresponsiveness due to GC.
-- Ivan On Thu, Mar 7, 2013 at 3:04 AM, simonw <[hidden email]> wrote: So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by simonw-2
Hi, Simon:
The problem is if I switched to 0.19.11 version and with the same environment, same test steps, the cluster works fine. In fact, we have two test clusters(one is this which have 20 nodes, another cluster has 25 nodes). Both clusters have the same problem. If using 0.19.11 version, both clusters work fine. But if using 0.20.4, 0.20.5 or 0.90.0.Beta1, both clusters have the same problem. I also tried to reproduce the problem with 2 nodes with several instances, the problem didn't happen. My test step is very simple. Just setup up the cluster and try to create 1~3 empty indices. In fact, this problem caused our system crashed after we upgraded es from 0.19.11 to 0.20.4 and all the data are lost. Thank you for your response. @other people, By the way, did others setup up the large cluster with 0.20.4, 0.20.5 or 0.90.0.Beta1? Can you share your result? -Regards- -Dong Aihua- 在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道: So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way? You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by simonw-2
By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish
在 2013年3月7日星期四UTC+8下午7时04分22秒,simonw写道: So, I looked closer at the latest logs and I see a lot of disconnects going on giving me the impression you have some network issues. Nevertheless we pushed some stuff to detect these situations earlier but non of us was able to reproduce your issues. the only thing I can ask you for is to try again with latest master to see if those commits helped in any way?-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Ivan Brusic
On Thursday, March 7, 2013 10:38:30 PM UTC+1, Ivan Brusic wrote:
yeah that is what I referred to: https://github.com/elasticsearch/elasticsearch/commit/9a25867bfe154357165c87a7b509029ff832efa4 simon -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Dong Aihua
Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them?
simon
-- On Monday, March 11, 2013 3:14:01 AM UTC+1, Dong Aihua wrote: By the way, for the servers disconnect at the end of logs, the reason is I did a shutdown command after the test finish You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, Simon:
Today I tested 0.19.12, it is fine. Then I tested 0.20.0, the problem happened again. So I guess the problem is introduced since 0.20.0. The test steps are as before. And the following are some logs: { "cluster_name" : "test-0.20.0", "status" : "red", "timed_out" : false, "number_of_nodes" : 20, "number_of_data_nodes" : 15, "active_primary_shards" : 224, "active_shards" : 445, "relocating_shards" : 0, "initializing_shards" : 4, "unassigned_shards" : 1 } { cluster_name: 'test-0.20.0', master_node: { name: 'xseed021.kdev', transport_address: 'inet[/10.96.250.211:9400]', attributes: { data: 'false', master: 'true' } }, initShards: [ { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}', primary: false, shard: 5, index: 'testx-xxx-2012-03-11' }, { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}', primary: false, shard: 11, index: 'test-2013-03-11' }, { node: '{"name":"xseed034.kdev","ip":"10.96.250.224:9400"}', primary: true, shard: 7, index: 'testx-xxx-2013-zzz' }, { node: '{"name":"xseed038.kdev","ip":"10.96.250.228:9400"}', primary: false, shard: 5, index: 'testx-xxx-2013-yyy' } ], unassigned_shards_total: 1, unassigned_indices: [ 'testx-xxx-2013-zzz' ] } The detail logs are located at https://github.com/dongaihua/shares/test-0.20.0_LOGS.tar.gz By the way, the servers are setup by IT team, I'm not clear the physical connections. If you really need those informaiton, I can ask and give you response. Thank you. -Regards- -dongaihua- 在 2013年3月11日星期一UTC+8下午3时56分07秒,simonw写道: Thanks for the headsup, that is what I figured. I am just wondering what triggered the intermediate disconnects that causes the servers to wait on their peers they recover from. Can you tell a little about your setup, are they VMs do you have firewalls in place. Are the servers in different datascenters, what is the connection between them? You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, Simon:
Today I tested the latest ES version 0.20.6, the problem seems fixed in this version. Can you confirm it? And I'm wondering how do you fix it? Thanks a lot. -Regards- -dongaihua-
-- 在 2013年3月11日星期一UTC+8下午6时10分08秒,Dong Aihua写道: Hi, Simon: You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hmm that is interesting... I suspect that the fix for the missed ChannelClosed events fixed it then though....I don't have you logs around anymore but did you see any TooManyOpenFiles exceptions by any chance? Can you confirm that the latest master has fixed this as well by running a master version / latest build?
simon
-- On Thursday, March 28, 2013 3:44:10 AM UTC+1, Dong Aihua wrote: Hi, Simon: You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Ok, I will test it and tell you the result and logs.
在 2013年3月28日星期四UTC+8下午4时07分17秒,simonw写道: Hmm that is interesting... I suspect that the fix for the missed ChannelClosed events fixed it then though....I don't have you logs around anymore but did you see any TooManyOpenFiles exceptions by any chance? Can you confirm that the latest master has fixed this as well by running a master version / latest build?-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, Simon:
I tested the latest trunk code, it also has no problem. The logs are located at [hidden email]:dongaihua/shares.git By the way, we ever met TooManyOpenFiles problem, but after we change the ulimit -n <number>, we don't meet that problems again. After you check the logs, can you give me a confirmation if the problem is really resolved? Thank you. 在 2013年3月28日星期四UTC+8下午4时36分00秒,Dong Aihua写道: Ok, I will test it and tell you the result and logs. You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi, Simon:
Today, I upgraded the es cluster from 0.19.11 to 0.20.6 successfully. The upgrade is smooth. And this problem doesn't see again. Thank you.
-- 在 2013年4月1日星期一UTC+8上午11时21分49秒,Dong Aihua写道: Hi, Simon: You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
| Powered by Nabble | Edit this page |
