Any clues about transport connection issues on AWS HVM instances?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Any clues about transport connection issues on AWS HVM instances?

Radu Gheorghe-2
Hi Elasticsearch list :)

I'm having some trouble while running Elasticsearch on r3.large (HVM virtualization) instances in AWS. The short story is that, as soon as I put any significant load on them, some requests take a very long time (for example, Indices Stats) and I see disconnected/timeout errors in the logs. Did anyone else experience similar things or has any ideas of another solution than avoiding HVM instances?

More detailed symptoms:
- if there's very little load on them (say, 2GB of data on each node, few queries and indexing operations) all is well
- by "significant load", I mean some 10GB of data, a few queries per minute, 100 docs indexed per second (4K per doc, <10 fields). By no means "overload", CPU rarely tops 20%, no significant GC, nothing suspicious in any of the metrics SPM collects. The only clue is that, for the time the problem appears, we get heartbeat alerts because requests to the stats APIs take too long
- by "some requests take very long time", I mean that some queries take miliseconds (as I would expect them), and some take 10 minutes or so. Eventually succeeding (at least this was the case for the manual requests I've sent)
- sometimes, nodes get temporarily dropped from the cluster, but then things quickly come back to green. However, sometimes shards were stuck while relocating

Things I've tried:
- different ES versions and machine sizes: the same problem seems to appear on 0.90.7 with r3.xlarge instances, I'm on 1.1.1 with r3.large
- teared down all machines and launched other ones and redeployed. Same thing
- different JVM (1.7) versions: Oracle u25, u45, u55, u60, OpenJDK u51. Same thing everywhere
- spawned the same number of machines with m3.large (same specs as r3.large, except for half of the RAM, paravirtual instead of HVM). The problem magically went away with the same data and load

Here are some Node Disconnected exceptions:
[2014-06-18 13:05:35,058][WARN ][search.action            ] [es01] Failed to send release search context
org.elasticsearch.transport.NodeDisconnectedException: [es02][inet[/10.140.1.84:9300]][search/freeContext] disconnected
[2014-06-18 13:05:35,058][DEBUG][action.admin.indices.stats] [es01] [83f0223f-4222-4a57-a918-ff424924f002_2014-05-20][1], node[oOlO-iewR3qnAuQkT28vfw], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@3339f285]
org.elasticsearch.transport.NodeDisconnectedException: [es02][inet[/10.140.1.84:9300]][indices/stats/s] disconnected

I've enabled TRACE logging on both transport and discovery and all I see is connection timeouts and exceptions, like:

07:29:19,039][TRACE][transport.netty ] [es01] close connection exception caught on transport layer [[id: 0x190d8444]], disconnecting from relevant node

Or, more verbose:

[2014-06-16 07:29:19,060][TRACE][transport.netty          ] [es01] connect exception caught on transport layer [[id: 0x6816c0fe]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection timed out: es03/10.171.39.244:9300
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2014-06-16 07:29:19,060][TRACE][discovery.zen.ping.unicast] [es01] [1] failed to connect to [#zen_unicast_7#][es01][inet[es04/10.79.155.249:9300]]
org.elasticsearch.transport.ConnectTransportException: [][inet[es04/10.79.155.249:9300]] connect_timeout[30s]
at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:683)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:643)
at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:610)
at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:133)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:279)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection timed out: es03/10.171.39.244:9300
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

I'll appreciate any information, pointers, intuition you may have!

Thanks and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_1vD405d8LbuDUV-vJ1yminf23%2BDCbRecFFnHZ4ywfj0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.