Very weird ES Cluster state problem!

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Very weird ES Cluster state problem!

Amit
Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]


Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce] 
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]



The Cluster health api from node1;

{
  "cluster_name" : "test1",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 4256,
  "active_shards" : 4256,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4209
}

The Cluster health api from node2;
{
  "cluster_name" : "test",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 8471,
  "active_shards" : 8471,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 0
}


I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.

Thanks in advance
Amit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

joergprante@gmail.com
Yes, it looks like a crash or a split. The problematic indexes are lost,
and should be deleted. Otherwise, ES is not able to resolve the
conflict. Note, there are precautions against such node splits, did you
change the default settings in zen discovery, minimum_master_nodes for
example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

> Hi All,
>
> I am having an ES cluster with 2 nodes. I am not as to what caused
> this issue;
>
> Node 2-
>
> [2] received shard failed for [TestDocTestDoc][2],
> node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to
> start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2]
> shard allocated for local recovery (post api), should exists, but
> doesn't]]]
> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2]
> [TestDoc][2] failed to start shard
> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
> [TestDoc][2] shard allocated for local recovery (post api), should
> exists, but doesn't
> at
> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
> at
> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:636)
> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2]
> sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg],
> [P], s[INITIALIZING], reason [Failed to start shard, message
> [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
> local recovery (post api), should exists, but doesn't]]]
> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2]
> received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg],
> [P], s[INITIALIZING], reason [Failed to start shard, message
> [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
> local recovery (post api), should exists, but doesn't]]]
> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master
> should not receive new cluster state from
> [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>
>
> Node1-
>
> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed
> to reduce search
> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
> execute phase [fetch], [reduce]
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
> at
> org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.ClassCastException:
> org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet
> cannot be cast to
> org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
> at
> org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
> at
> org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
> at
> org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
> ... 8 more
> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1]
> [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0],
> mappings [TestDoc2~type1]
> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1]
> [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0],
> mappings [TestDoc3~type1]
>
>
>
> The Cluster health api from node1;
>
> {
>   "cluster_name" : "test1",
>   "status" : "red",
>   "timed_out" : false,
>   "number_of_nodes" : 1,
>   "number_of_data_nodes" : 1,
>   "active_primary_shards" : 4256,
>   "active_shards" : 4256,
>   "relocating_shards" : 0,
>   "initializing_shards" : 0,
>   "unassigned_shards" : 4209
> }
>
> The Cluster health api from node2;
> {
>   "cluster_name" : "test",
>   "status" : "red",
>   "timed_out" : false,
>   "number_of_nodes" : 2,
>   "number_of_data_nodes" : 2,
>   "active_primary_shards" : 8471,
>   "active_shards" : 8471,
>   "relocating_shards" : 0,
>   "initializing_shards" : 4,
>   "unassigned_shards" : 0
> }
>
>
> I looked through the ES group but could not find the exact issue.
> It looks like one of the node ( primary) left the cluster because of
> the network issue( not sure what was the issue, assuming network
> issue). And the secondary got elected as master. And when the network
> issue was resolved. The primary node was trying to join the cluster,
> which did happen. But probably the state was not synched? or there two
> master nodes master1- having two node in cluster, but not able to
> communicate with data node. master2- having only one node in cluster.
>
> Please help me as this is going crazy over my head. I looked through
> the different threads, but nothing concrete.
>
> Thanks in advance
> Amit
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

kimchy
Administrator
Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <[hidden email]> wrote:

> Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?
>
> Best regards,
>
> Jörg
>
> Am 01.02.13 19:01, schrieb Amit Singh:
>> Hi All,
>>
>> I am having an ES cluster with 2 nodes. I am not as to what caused this issue;
>>
>> Node 2-
>>
>> [2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
>> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
>> at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
>> at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>>
>>
>> Node1-
>>
>> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
>> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
>> at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
>> at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
>> at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
>> at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
>> ... 8 more
>> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
>> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]
>>
>>
>>
>> The Cluster health api from node1;
>>
>> {
>> "cluster_name" : "test1",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 1,
>> "number_of_data_nodes" : 1,
>> "active_primary_shards" : 4256,
>> "active_shards" : 4256,
>> "relocating_shards" : 0,
>> "initializing_shards" : 0,
>> "unassigned_shards" : 4209
>> }
>>
>> The Cluster health api from node2;
>> {
>> "cluster_name" : "test",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 2,
>> "number_of_data_nodes" : 2,
>> "active_primary_shards" : 8471,
>> "active_shards" : 8471,
>> "relocating_shards" : 0,
>> "initializing_shards" : 4,
>> "unassigned_shards" : 0
>> }
>>
>>
>> I looked through the ES group but could not find the exact issue.
>> It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.
>>
>> Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.
>>
>> Thanks in advance
>> Amit
>> --
>> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

Amit
Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
 Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:
Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="XhTDV__jpkcJ">joerg...@...> wrote:

> Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?
>
> Best regards,
>
> Jörg
>
> Am 01.02.13 19:01, schrieb Amit Singh:
>> Hi All,
>>
>> I am having an ES cluster with 2 nodes. I am not as to what caused this issue;
>>
>> Node 2-
>>
>> [2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
>> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
>> at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
>> at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>>
>>
>> Node1-
>>
>> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
>> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
>> at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
>> at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
>> at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
>> at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
>> ... 8 more
>> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
>> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]
>>
>>
>>
>> The Cluster health api from node1;
>>
>> {
>> "cluster_name" : "test1",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 1,
>> "number_of_data_nodes" : 1,
>> "active_primary_shards" : 4256,
>> "active_shards" : 4256,
>> "relocating_shards" : 0,
>> "initializing_shards" : 0,
>> "unassigned_shards" : 4209
>> }
>>
>> The Cluster health api from node2;
>> {
>> "cluster_name" : "test",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 2,
>> "number_of_data_nodes" : 2,
>> "active_primary_shards" : 8471,
>> "active_shards" : 8471,
>> "relocating_shards" : 0,
>> "initializing_shards" : 4,
>> "unassigned_shards" : 0
>> }
>>
>>
>> I looked through the ES group but could not find the exact issue.
>> It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.
>>
>> Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.
>>
>> Thanks in advance
>> Amit
>> --
>> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="XhTDV__jpkcJ">elasticsearc...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="XhTDV__jpkcJ">elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

Igor Motov-3
Are these the actual responses or you changed the cluster names before posting them on the mailing list?

The Cluster health api from node1;

{
  "cluster_name" : "test1",
  "status" : "red",


The Cluster health api from node2;
{
  "cluster_name" : "test",
  "status" : "red",


On Sunday, February 3, 2013 12:08:56 AM UTC-5, Amit Singh wrote:
Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
 Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:
Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <[hidden email]> wrote:

> Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?
>
> Best regards,
>
> Jörg
>
> Am 01.02.13 19:01, schrieb Amit Singh:
>> Hi All,
>>
>> I am having an ES cluster with 2 nodes. I am not as to what caused this issue;
>>
>> Node 2-
>>
>> [2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
>> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
>> at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
>> at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>>
>>
>> Node1-
>>
>> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
>> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
>> at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
>> at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
>> at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
>> at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
>> ... 8 more
>> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
>> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]
>>
>>
>>
>> The Cluster health api from node1;
>>
>> {
>> "cluster_name" : "test1",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 1,
>> "number_of_data_nodes" : 1,
>> "active_primary_shards" : 4256,
>> "active_shards" : 4256,
>> "relocating_shards" : 0,
>> "initializing_shards" : 0,
>> "unassigned_shards" : 4209
>> }
>>
>> The Cluster health api from node2;
>> {
>> "cluster_name" : "test",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 2,
>> "number_of_data_nodes" : 2,
>> "active_primary_shards" : 8471,
>> "active_shards" : 8471,
>> "relocating_shards" : 0,
>> "initializing_shards" : 4,
>> "unassigned_shards" : 0
>> }
>>
>>
>> I looked through the ES group but could not find the exact issue.
>> It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.
>>
>> Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.
>>
>> Thanks in advance
>> Amit
>> --
>> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

Radu Gheorghe-2
In reply to this post by Amit

Hi,

For two nodes you should have minimum_master_nodes=2, otherwise you can end up with 2 1-node clusters. And 2/2+1=2

On Feb 3, 2013 7:09 AM, "Amit Singh" <[hidden email]> wrote:
Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
 Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:
Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <[hidden email]> wrote:

> Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?
>
> Best regards,
>
> Jörg
>
> Am 01.02.13 19:01, schrieb Amit Singh:
>> Hi All,
>>
>> I am having an ES cluster with 2 nodes. I am not as to what caused this issue;
>>
>> Node 2-
>>
>> [2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
>> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
>> at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
>> at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>>
>>
>> Node1-
>>
>> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
>> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
>> at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
>> at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
>> at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
>> at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
>> ... 8 more
>> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
>> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]
>>
>>
>>
>> The Cluster health api from node1;
>>
>> {
>> "cluster_name" : "test1",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 1,
>> "number_of_data_nodes" : 1,
>> "active_primary_shards" : 4256,
>> "active_shards" : 4256,
>> "relocating_shards" : 0,
>> "initializing_shards" : 0,
>> "unassigned_shards" : 4209
>> }
>>
>> The Cluster health api from node2;
>> {
>> "cluster_name" : "test",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 2,
>> "number_of_data_nodes" : 2,
>> "active_primary_shards" : 8471,
>> "active_shards" : 8471,
>> "relocating_shards" : 0,
>> "initializing_shards" : 4,
>> "unassigned_shards" : 0
>> }
>>
>>
>> I looked through the ES group but could not find the exact issue.
>> It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.
>>
>> Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.
>>
>> Thanks in advance
>> Amit
>> --
>> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

Amit
Hi Radu,

Thanks for the response.

As per the suggestion I have made the changes in the ES config "minimum_master_nodes" to 2. And since it was time to add a new node in cluster. I added one more node in cluster ( total 3 nodes) and "minimum_master_nodes" to 2.

All my nodes have some configuration.
Java max heap- 12g ; memory available on system -16g (AWS - XLarge)
Data drives to hold the data- 4 drives - 500gb each.

1-When I restarted the cluster, lot of shards got relocated to the new node and when the cluster health became stable. The new node ( say node3 ) was holding data anywhere between 45-50 % of the total data. I assumed the data distribution among the nodes to be uniform?

2- When the system is creating new Indexes. The average load for the new node3 is going very high and is constantly between 100-200. When I compare the size of the data across all the node, 85-90 % of the new index data goes to the new node3?

3- Change in the lucene merge setting will help or not. like index.merge.policy.max_merge_at_once setting this to a higher value or index.merge.policy.min_merge_size to a higher value.

Please help me understand the above.

I read through the post;
https://github.com/blog/1397-recent-code-search-outages

I am using ES 0.19.4 version and oracle java 1.6.0_31. Is this the best combination or do I need to change the java version to 7 or any other version of java. I cannot change the es vesrion, as i have dependency around it.

When I add a new node to a live cluster, what is expected behavior of ES. I mean since the es will be busy relocating the shards. What will the impact on indexing new data and performing search on the existing data.

My apology for the lengthy post as I did not expected it to be!

Thanks
Amit




On Tuesday, February 5, 2013 12:56:20 AM UTC+5:30, Radu Gheorghe wrote:

Hi,

For two nodes you should have minimum_master_nodes=2, otherwise you can end up with 2 1-node clusters. And 2/2+1=2

On Feb 3, 2013 7:09 AM, "Amit Singh" <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="SozDv7LqC50J">amitsi...@...> wrote:
Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
 Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:
Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <[hidden email]> wrote:

> Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?
>
> Best regards,
>
> Jörg
>
> Am 01.02.13 19:01, schrieb Amit Singh:
>> Hi All,
>>
>> I am having an ES cluster with 2 nodes. I am not as to what caused this issue;
>>
>> Node 2-
>>
>> [2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
>> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
>> at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
>> at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>>
>>
>> Node1-
>>
>> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
>> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
>> at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
>> at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
>> at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
>> at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
>> ... 8 more
>> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
>> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]
>>
>>
>>
>> The Cluster health api from node1;
>>
>> {
>> "cluster_name" : "test1",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 1,
>> "number_of_data_nodes" : 1,
>> "active_primary_shards" : 4256,
>> "active_shards" : 4256,
>> "relocating_shards" : 0,
>> "initializing_shards" : 0,
>> "unassigned_shards" : 4209
>> }
>>
>> The Cluster health api from node2;
>> {
>> "cluster_name" : "test",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 2,
>> "number_of_data_nodes" : 2,
>> "active_primary_shards" : 8471,
>> "active_shards" : 8471,
>> "relocating_shards" : 0,
>> "initializing_shards" : 4,
>> "unassigned_shards" : 0
>> }
>>
>>
>> I looked through the ES group but could not find the exact issue.
>> It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.
>>
>> Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.
>>
>> Thanks in advance
>> Amit
>> --
>> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="SozDv7LqC50J">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Very weird ES Cluster state problem!

Radu Gheorghe-2
Hello Amit,

Normally, Elasticsearch tries to balance the number of shards across nodes. It doesn't look at how much data is in these shards or which index the shards belong to.

That might explain your situation, but I'm not sure. If it doesn't make sense to you, please say some more about your index setup. Stuff like how many indices you have, how many shards per index, which kind of documents go in which index and what's the size of each shard.

The good news is you can configure Elasticsearch to allocate shards in various ways. Take at these links:
http://www.elasticsearch.org/guide/reference/index-modules/allocation.html
http://www.elasticsearch.org/guide/reference/modules/cluster.html
http://www.elasticsearch.org/guide/reference/api/admin-cluster-reroute.html

Although I think the last one is not available in 0.19.4.

Best regards,
Radu
--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Wed, Feb 13, 2013 at 10:56 AM, Amit Singh <[hidden email]> wrote:
Hi Radu,

Thanks for the response.

As per the suggestion I have made the changes in the ES config "minimum_master_nodes" to 2. And since it was time to add a new node in cluster. I added one more node in cluster ( total 3 nodes) and "minimum_master_nodes" to 2.

All my nodes have some configuration.
Java max heap- 12g ; memory available on system -16g (AWS - XLarge)
Data drives to hold the data- 4 drives - 500gb each.

1-When I restarted the cluster, lot of shards got relocated to the new node and when the cluster health became stable. The new node ( say node3 ) was holding data anywhere between 45-50 % of the total data. I assumed the data distribution among the nodes to be uniform?

2- When the system is creating new Indexes. The average load for the new node3 is going very high and is constantly between 100-200. When I compare the size of the data across all the node, 85-90 % of the new index data goes to the new node3?

3- Change in the lucene merge setting will help or not. like index.merge.policy.max_merge_at_once setting this to a higher value or index.merge.policy.min_merge_size to a higher value.

Please help me understand the above.

I read through the post;

I am using ES 0.19.4 version and oracle java 1.6.0_31. Is this the best combination or do I need to change the java version to 7 or any other version of java. I cannot change the es vesrion, as i have dependency around it.

When I add a new node to a live cluster, what is expected behavior of ES. I mean since the es will be busy relocating the shards. What will the impact on indexing new data and performing search on the existing data.

My apology for the lengthy post as I did not expected it to be!

Thanks
Amit




On Tuesday, February 5, 2013 12:56:20 AM UTC+5:30, Radu Gheorghe wrote:

Hi,

For two nodes you should have minimum_master_nodes=2, otherwise you can end up with 2 1-node clusters. And 2/2+1=2

On Feb 3, 2013 7:09 AM, "Amit Singh" <[hidden email]> wrote:
Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
 Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:
Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <[hidden email]> wrote:

> Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?
>
> Best regards,
>
> Jörg
>
> Am 01.02.13 19:01, schrieb Amit Singh:
>> Hi All,
>>
>> I am having an ES cluster with 2 nodes. I am not as to what caused this issue;
>>
>> Node 2-
>>
>> [2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,667][WARN ][indices.cluster          ] [2] [TestDoc][2] failed to start shard
>> org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
>> at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
>> at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:37:41,834][WARN ][cluster.action.shard     ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
>> [2013-02-01 16:39:00,306][WARN ][discovery.zen            ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]
>>
>>
>> Node1-
>>
>> [2013-02-01 10:08:03,861][DEBUG][action.search.type       ] [1] failed to reduce search
>> org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
>> at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
>> at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
>> at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
>> at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
>> at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
>> ... 8 more
>> [2013-02-01 12:59:42,516][INFO ][cluster.metadata         ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
>> [2013-02-01 13:00:34,555][INFO ][cluster.metadata         ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]
>>
>>
>>
>> The Cluster health api from node1;
>>
>> {
>> "cluster_name" : "test1",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 1,
>> "number_of_data_nodes" : 1,
>> "active_primary_shards" : 4256,
>> "active_shards" : 4256,
>> "relocating_shards" : 0,
>> "initializing_shards" : 0,
>> "unassigned_shards" : 4209
>> }
>>
>> The Cluster health api from node2;
>> {
>> "cluster_name" : "test",
>> "status" : "red",
>> "timed_out" : false,
>> "number_of_nodes" : 2,
>> "number_of_data_nodes" : 2,
>> "active_primary_shards" : 8471,
>> "active_shards" : 8471,
>> "relocating_shards" : 0,
>> "initializing_shards" : 4,
>> "unassigned_shards" : 0
>> }
>>
>>
>> I looked through the ES group but could not find the exact issue.
>> It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.
>>
>> Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.
>>
>> Thanks in advance
>> Amit
>> --
>> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.