Multiple cluster state copies in memory VS many aliases
I'm not sure if you could do anything about the issue, but still, I wanted to make sure you are aware of it. So, here is the story:
I work for Swiftype.com. Due to the nature of our business and the way we implement user search engines as separate or aggregated indexes in ES and use per-customer aliases to route requests to an appropriate place. So, we end up with clusters that have ~1000 shards and could easily have 50-150k aliases. ES holds up really well under load and the only issue we have which has been the major pain point for us for the last 4 months is the way ES handles cluster state changes.
Every time we add/remove an alias or an index, ES generates gigabytes of garbage in java memory. Adding more java heap does not really help, because then those piles of garbage start causing very long old gen pauses and that is really painful for us since our clusters are constantly under load and having them stop for seconds to collect garbage is unacceptable.
For months we've been building all kinds of bandages to mitigate the effects of this issue (like creating an external per-cluster locking system to avoid consurrent cluster state changes and other crazy stuff like that) and trying to figure out why was it was happening without much luck.
Yesterday we've added a master-only (no data) node to one of our small clusters and gave it 4G of heap (with new ratio = 3). Even during the process of joining the cluster it went to GC for 5-10 sec multiple times, every time failing to join becuase of timeouts. After many retries it'd finally join and old gen would keep steady at ~1.2Gb (out of 3Gb we gave it) for hours and hours. But then, when some users would start creating/changing their indexes too frequently (once every few seconds), the server would just go crazy and end up in a Full GC trashing mode.
I've made a java heap dump from this server today and since there is no Lucene crap in it (as it usually is on larger instances with data), it was REALLY obvious now where the problem is: in my dump I see our cluster MetaData object size is ~681Mb and we have 6 (!) copies of it live in memory:
* one copy in current cluster metadata
* one copy in a running InternalClusterService$UpdateTask
* and 4 copies in InternalClusterService$UpdateTask instances in the blocking queue for update service.
I understand that we could add many more gigabytes of RAM and that would help us to handle such "surges" of cluster state updates, but I'm afraid that would just cost us a lot in GC pauses and would not really solve the issues we're having on data nodes anyways.
So, I was wondering if you have any suggestions on how can we mitigate this issue in the short term (aside from not creating so many aliases, but that does it not really possible for us at the moment) and if you have any ideas of how it could be solved in the server code in the long run.
I'd really appreciate any help!
P.S. We're on 0.90.5. If there are any fixes around cluster state handling in the newer versions, please let me know. We're going to be upgrading our clusters very soon and if there is a chance new release would help with GC issues, I'd bump up the priority of the upgrade task in our task list.