One primary shards are "lost" permanently when updating data

Jingzhao Ou
Hi, all, 

I have an urgent case and appreciate any help. I have an index with 5 shards, no replica, and running fine on a single AWS EC2 c3.large box for months.I now need to update around 1 million data entries in this index. I use the bulk operations to do the update on 50K batches. After about 300K updates, my bulk operation started to time out and failed. I then checked the index status and got 

  "_shards" : {
    "total" : 5,
    "successful" : 4,
    "failed" : 0

One shard is lost permanently. I could still query for data. But when I did any index operations afterwards, it timed out every few tries. The only way to fix this is to wipe out the whole index and restore it from snapshots. I tested Elasticsearch 1.3.5 and 1.4.1, both have this symptom. I tried pausing for 10 seconds between each bulk updates, setting refresh_rate to -1. None of them helps. 

Strangely, I ran the same operation on my Windows 8 machine and things worked just fine there. Not sure why it failed so badly on AWS. My data is stored on a 100G EBS. Can anyone give me some help? I really worry about the data lost at this time. 

Thanks a lot!

