but that was a while back and I was wondering if there was more support built for this use case, namely:
- Millions of small indexes distributed over a fleet of machines
- Only a small subset of those indexes is active at any time.
- Indexes unused for a while get evicted automatically to free up space (back then Shay said it would be tricky to implement)
- Having an overhead per Lucene index is still worth it given then above and faster search speed per index, and faster updates per index.
- each machine can fit several thousand indices in memory using either mmap directory and file system cache, or the bytebuffer directory that elasticsearch implemented.
Is this now a viable scenario in elasticsearch?
Some additional questions:
- The closing of the indices was advised to be done via an API call, meaning someone else has to keep track of LRU logic. Can this be done automatically? Say oldest indices get auto closed when their number grows beyond certain count on a particular machine (this can potentially lead to thrashing, but then the fleet would need to be big enough to avoid it).
- When opening an index on the fly, is it possible to get a pre-build index from somewhere else and bootstrap it in as opposed to just giving raw data to ES and have it index on demand? The goal is to make the search available asap and not waste the CPU cycles on indexing on the ES fleet itself.
Does ES support the above or if not, would it be a lot of work to create modules for that.
The entire guide is a lot to take in at once, could you please point me to specific places that I can read to get a better idea on this use case, and examples of modules that manage indices to see if I can build something that suits me if it doesn't already exist.
the main overhead comes from having many lucene indices (a shard in ES, or a core in solr).
what you mentioned is pretty much the state of ES. you can close/open indices, but you need to manage it yourself.
you can store the index somewhere else, but you will need to copy it over to the cluster yourself and call open (and pay the price of copying it over). snapshot/restore feature n 1.0 will allow to do it more easily.
obviously the more machines you have, the more indices/shards you will have. Most cases, you can overload a single index with the "many indices" case, check my video at buzz on the video page where I Tal about data design patterns and spaicially the users design pattern.