Cluster Health degraded overnight with no apparent reason

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster Health degraded overnight with no apparent reason

Curt Kohler

This morning I arrived to find our test ES cluster in poor health...  

Some background on our cluster...

Single ES index running on a single AWS Hi IO instance (60 GB RAM, 2TB SSD storage, 16 virtual cores).  Index had 8 shards, contained approximately 12 million documents,  and is approximately 450 GB in size. Running ElasticSearch 0.20.5, Java OpenJDK 1.6.0_24 (and I know I'd be better off with Oracle 1.7, just haven't had the time to update our install scripts).

Yesterday afternoon we had been running some load tests against the index that had successfully completed and the cluster was in good health when I left for the day. When I returned in the morning, I was unable to run any searches and after some digging, saw that the cluster was having problems.   At about the time of my first search of the morning, ES started putting out log entries for the index as if it had just restarted. Note, there were no log entries indicating problems overnight after the load test ended. This morning, ES appeared to be thrashing over the various shards trying to recover them. The start of today's log file.....

[2013-05-07 07:37:43,798][INFO ][node                     ] [ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: initializing ...
[2013-05-07 07:37:43,906][INFO ][plugins                  ] [ip-10-148-148-93.ec2.internal] loaded [cloud-aws], sites [bigdesk, head, paramedic]
[2013-05-07 07:37:47,209][INFO ][node                     ] [ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: initialized
[2013-05-07 07:37:47,209][INFO ][node                     ] [ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: starting ...
[2013-05-07 07:37:47,612][INFO ][transport                ] [ip-10-148-148-93.ec2.internal] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/X.X.X.X:9300]}
[2013-05-07 07:37:51,739][INFO ][cluster.service          ] [ip-10-148-148-93.ec2.internal] new_master [ip-10-148-148-93.ec2.internal][t5V-xrEnTp2FtCJxBJ5gPg][inet[/X.X.X.X:9300]]{max_local_storage_nodes=1}, reason: zen-disco-join (elected_as_master)
[2013-05-07 07:37:51,774][INFO ][discovery                ] [ip-10-148-148-93.ec2.internal] sdxcr/t5V-xrEnTp2FtCJxBJ5gPg
[2013-05-07 07:37:51,837][INFO ][http                     ] [ip-10-148-148-93.ec2.internal] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/X.X.X.X:9200]}
[2013-05-07 07:37:51,838][INFO ][node                     ] [ip-10-148-148-93.ec2.internal] {0.20.5}[2261]: started
[2013-05-07 07:37:51,905][DEBUG][action.admin.indices.status] [ip-X.X.X.X.ec2.internal] [sdxcr][0], node[t5V-xrEnTp2FtCJxBJ5gPg], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@305f387c]
org.elasticsearch.indices.IndexMissingException: [sdxcr] missing
        at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:679)
[2013-05-07 07:37:51,930][DEBUG][action.admin.indices.status] [ip-X.X.X.X.ec2.internal] [sdxcr][2], node[t5V-xrEnTp2FtCJxBJ5gPg], [P], s[INITIALIZING]: Failed to execute [o
rg.elasticsearch.action.admin.indices.status.IndicesStatusRequest@305f387c]
org.elasticsearch.indices.IndexMissingException: [sdxcr] missing
        at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)        at java.lang.Thread.run(Thread.java:679)

...

eventually leading into repeating errors like these for the various shards:


[2013-05-07 07:37:52,958][DEBUG][action.admin.indices.status] [ip-X.X.X.X.ec2.internal] [sdxcr][7], node[t5V-xrEnTp2FtCJxBJ5gPg], [P], s[INITIALIZING]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@6446154e]
org.elasticsearch.index.IndexShardMissingException: [sdxcr][7] missing
        at org.elasticsearch.index.service.InternalIndexService.shardSafe(InternalIndexService.java:179)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:153)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:234)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:211)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$1.run(TransportBroadcastOperationAction.java:187)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:679)
[2013-05-07 07:37:53,021][WARN ][indices.cluster          ] [ip-X.X.X.X.ec2.internal] [sdxcr][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [sdxcr][2] failed recovery
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:228)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: [sdxcr][2] failed to create engine
        at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:252)
        at org.elasticsearch.index.shard.service.InternalIndexShard.start(InternalIndexShard.java:279)        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:182)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
        ... 3 more
Caused by: java.io.FileNotFoundException: _5d.fnm
        at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:519)
        at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:71)
        at org.apache.lucene.index.IndexWriter.getFieldInfos(IndexWriter.java:1212)
        at org.apache.lucene.index.IndexWriter.getCurrentFieldInfos(IndexWriter.java:1228)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1161)
        at org.apache.lucene.index.XIndexWriter.<init>(XIndexWriter.java:17)
        at org.elasticsearch.index.engine.robin.RobinEngine.createWriter(RobinEngine.java:1365)
        at org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:250)
        ... 6 more

This just goes on and on as it fails recovering a shard and tries to recover a different one.

I was curious if anyone else had experienced a similar cluster failure and had any suggestions on how to prevent it in the future.

Thanks for any insights

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Health degraded overnight with no apparent reason

Drew Raines-2
Curt Kohler wrote:

> This morning I arrived to find our test ES cluster in poor
> health...  

[...]

> I was curious if anyone else had experienced a similar cluster
> failure and
> had any suggestions on how to prevent it in the future.

These kinds of problems are symptomatic of running out of disk
space or possibly experiencing OutOfMemoryExceptions.  Do you have
any monitoring on the cluster that can provide any more clues as
to what happened during the night?

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Cluster Health degraded overnight with no apparent reason

Curt Kohler
Unfortunately nothing beyond bigdesk/head, and nothing machine
level...  The cluster is just a "play around with instance" to look at
the Amazon HiIO instance types with ElasticSearch.

I saw a number of postings pointing to #Files open, memory, or disk
space as potential sources of similar exceptions. I'm not sure this
would fall under those categories as the problem occurred some point
after we ran our 4 hour load test (which had no problems and didn't
run out of memory). After that, the cluster would basically have been
idle until I came back in the morning. The machine still has over 1TB
of disk space available under the data directories so that is probably
not the problem either. I need to go back and check, but I'm pretty
sure the file descriptors are set to be unlimited for the process when
we start it up.

Thanks for the suggestions...

Curt

On May 7, 10:22 am, Drew Raines <[hidden email]> wrote:

> Curt Kohler wrote:
> > This morning I arrived to find our test ES cluster in poor
> > health...
>
> [...]
>
> > I was curious if anyone else had experienced a similar cluster
> > failure and
> > had any suggestions on how to prevent it in the future.
>
> These kinds of problems are symptomatic of running out of disk
> space or possibly experiencing OutOfMemoryExceptions.  Do you have
> any monitoring on the cluster that can provide any more clues as
> to what happened during the night?
>
> Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Cluster Health degraded overnight with no apparent reason

InquiringMind
In reply to this post by Curt Kohler
Curt,

OpenJDK? Not real Sun/Oracle Java?

For reasons beyond my own control, I've run ES 0.20.4 and now 0.90.0 on Sun Java 6, not 7. But it's real Java, not OpenJDK.

On a Solaris 3-node cluster:

-bash-3.00$ java -version
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) Client VM (build 16.0-b13, mixed mode, sharing)

And on my Mac:

$ java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-10M4203)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)

I haven't had any issues with this Oracle's Java 6 (above).


On Tuesday, May 7, 2013 9:28:14 AM UTC-4, Curt Kohler wrote:

This morning I arrived to find our test ES cluster in poor health...  

Some background on our cluster...

Single ES index running on a single AWS Hi IO instance (60 GB RAM, 2TB SSD storage, 16 virtual cores).  Index had 8 shards, contained approximately 12 million documents,  and is approximately 450 GB in size. Running ElasticSearch 0.20.5, Java OpenJDK 1.6.0_24 (and I know I'd be better off with Oracle 1.7, just haven't had the time to update our install scripts).

Yesterday afternoon we had been running some load tests against the index that had successfully completed and the cluster was in good health when I left for the day. When I returned in the morning, I was unable to run any searches and after some digging, saw that the cluster was having problems.   At about the time of my first search of the morning, ES started putting out log entries for the index as if it had just restarted. Note, there were no log entries indicating problems overnight after the load test ended. This morning, ES appeared to be thrashing over the various shards trying to recover them. The start of today's log file.....
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Cluster Health degraded overnight with no apparent reason

Otis Gospodnetic
In reply to this post by Drew Raines-2
Curt,

It won't help you now, but will in the future - like Drew said, put some monitoring with historical trending on that ES cluster.  Try SPM for ES (see signature), we use it for ES and a bunch of other things.  Coincidentally, we're doing some ES stress testing now and using it to understand what's happening.  There is little 'share' icon next to each graph, so you can share any graphs you want people here to see.

Something went wrong at Lucene-level here, judging from that last exception in your paste.  Was that an indexing stress test or a search stress test?

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, May 7, 2013 10:22:49 AM UTC-4, Drew Raines wrote:
Curt Kohler wrote:

> This morning I arrived to find our test ES cluster in poor
> health...  

[...]

> I was curious if anyone else had experienced a similar cluster
> failure and
> had any suggestions on how to prevent it in the future.

These kinds of problems are symptomatic of running out of disk
space or possibly experiencing OutOfMemoryExceptions.  Do you have
any monitoring on the cluster that can provide any more clues as
to what happened during the night?

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.