Data loss after network disconnect

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Data loss after network disconnect

Israel Tsadok
A temporary network disconnect of the master node caused a torrent of RELOCATING shards, and then one shard remained UNASSIGNED and the cluster state was left red.

looking inside the index directory for the shard on the disk, I found that it was empty (i.e., the _state and translog dirs were there, but the index dir had no files).

Looking at the log files, I see that the disconnect happened around 11:42:05, and a few minutes later I start seeing these error messages:

[2014-09-10 11:45:33,341][WARN ][indices.cluster          ] [buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard
[2014-09-10 11:45:33,342][WARN ][cluster.action.shard     ] [buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for [el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: IndexNotFoundException[no segments* file found in store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index), type=MERGE, rate=20.0)]): files: []]; ]]

data009 is the original master, data017 is the new master, and data008 is where I found the empty index directory.

I had to delete the unassigned index from the cluster to return to green state.
I am running Elasticsearch 1.2.1 in a 20 node cluster. 

How does this happen? What can I do to prevent this from happening again?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADdQPqz%3DXZDMbHC7zpWgEdaqW4Xy_VkX7EgRwfXsrJjuoQ50SA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Data loss after network disconnect

Igor Motov-3
How were these nodes doing in terms of available heap space before the disconnects occurred? 

On Wednesday, September 10, 2014 6:26:19 AM UTC-4, Israel Tsadok wrote:
A temporary network disconnect of the master node caused a torrent of RELOCATING shards, and then one shard remained UNASSIGNED and the cluster state was left red.

looking inside the index directory for the shard on the disk, I found that it was empty (i.e., the _state and translog dirs were there, but the index dir had no files).

Looking at the log files, I see that the disconnect happened around 11:42:05, and a few minutes later I start seeing these error messages:

[2014-09-10 11:45:33,341][WARN ][indices.cluster          ] [buzzilla_data008] [el-2011-10-31-0000][0] failed to start shard
[2014-09-10 11:45:33,342][WARN ][cluster.action.shard     ] [buzzilla_data008] [el-2011-10-31-0000][0] sending failed shard for [el-2011-10-31-0000][0], node[RAR26zfuTiKl4mdbRVTtNA], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] failed to fetch index version after copying it over]; nested: IndexShardGatewayRecoveryException[[el-2011-10-31-0000][0] shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: IndexNotFoundException[no segments* file found in store(least_used[rate_limited(mmapfs(/home/omgili/data/elasticsearch/data/buzzilla/nodes/0/indices/el-2011-10-31-0000/0/index), type=MERGE, rate=20.0)]): files: []]; ]]

The relevant log files are at <a href="https://gist.github.com/itsadok/97453743d6b211681aca" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fitsadok%2F97453743d6b211681aca\46sa\75D\46sntz\0751\46usg\75AFQjCNHvij1NM-jM0yGhvH383shNr7VeKg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgist.github.com%2Fitsadok%2F97453743d6b211681aca\46sa\75D\46sntz\0751\46usg\75AFQjCNHvij1NM-jM0yGhvH383shNr7VeKg';return true;">https://gist.github.com/itsadok/97453743d6b211681aca
data009 is the original master, data017 is the new master, and data008 is where I found the empty index directory.

I had to delete the unassigned index from the cluster to return to green state.
I am running Elasticsearch 1.2.1 in a 20 node cluster. 

How does this happen? What can I do to prevent this from happening again?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/749729f6-daa1-470c-a835-d8f5dd85ad87%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.