Any clues about transport connection issues on AWS HVM instances?
Hi Elasticsearch list :)
I'm having some trouble while running Elasticsearch on r3.large (HVM virtualization) instances in AWS. The short story is that, as soon as I put any significant load on them, some requests take a very long time (for example, Indices Stats) and I see disconnected/timeout errors in the logs. Did anyone else experience similar things or has any ideas of another solution than avoiding HVM instances?
More detailed symptoms:
- if there's very little load on them (say, 2GB of data on each node, few queries and indexing operations) all is well
- by "significant load", I mean some 10GB of data, a few queries per minute, 100 docs indexed per second (4K per doc, <10 fields). By no means "overload", CPU rarely tops 20%, no significant GC, nothing suspicious in any of the metrics SPM collects. The only clue is that, for the time the problem appears, we get heartbeat alerts because requests to the stats APIs take too long
- by "some requests take very long time", I mean that some queries take miliseconds (as I would expect them), and some take 10 minutes or so. Eventually succeeding (at least this was the case for the manual requests I've sent)
- sometimes, nodes get temporarily dropped from the cluster, but then things quickly come back to green. However, sometimes shards were stuck while relocating
Things I've tried:
- different ES versions and machine sizes: the same problem seems to appear on 0.90.7 with r3.xlarge instances, I'm on 1.1.1 with r3.large
- teared down all machines and launched other ones and redeployed. Same thing
- different JVM (1.7) versions: Oracle u25, u45, u55, u60, OpenJDK u51. Same thing everywhere
- spawned the same number of machines with m3.large (same specs as r3.large, except for half of the RAM, paravirtual instead of HVM). The problem magically went away with the same data and load