Elasticsearch data node does not failover when data disk fails
This post has NOT been accepted by the mailing list yet.
In my production environment we are using Elasticsearch 2.3.1 and we have a configuration with 1 master dedicated node and 2 data nodes (lets call them dnode1 and dnode2) all running on separate servers.
I had a situation where the data drive on one of the data nodes (dnode2) went down. Elasticsearch on dnode2 was still running but the instance could no longer access the data drive where the data directory resided. The cluster still thought dnode2 was available so the master node kept trying to send new documents to dnode2. Sometime later after we discovered what happened, we manually stopped the Elasticsearch instance on dnode2 and then Elasticsearch failed over to dnode1. After the failure and until we stopped Elasticsearch on dnode2, no new entries could successfully get inserted. Additionally, many of the shards didn't get reassigned to the other node (dnode1) and remained in the UNASSIGNED state. I had to manually do a reroute on those indices to assign them to the working dnode1.
We do have our data disk setup in a raid configuration but we have had some unexpected problems with it so I want to know if anyone has come across this issue and what I can do enable Elasticsearch to handle such scenarios.