How to sort on small cluster with 100m+ documents without OOME
I have to query my data on sort them by some field (datetime for example).
If I query the indexes with various criteria there is no problem until a sorting is added. I was expected to have the filtered data sorted (matched ~1k of 100m+) but it seems the engine tries to load all the field values into the memory which causes OOME.
The question is if there is an opportunity of sorting only the filtered data only? (I do not store the document's source)
Re: How to sort on small cluster with 100m+ documents without OOME
I'm somewhat pessimistic about sorting, from both a technical and a user-experience standpoint:
* Technically - if you're not caching masses of values in RAM (your current problem) you're having to hit disk randomly to retrieve these values for a large number of matching docs (which is slow).
* From a user-experience point of view: a lot of what Lucene has to offer is in relevance-ranked ordering of partial matches i.e. it is designed to produce fuzzy sets of results where, unlike databases, membership of the set is measured to a degree. Sorting a fuzzy set, (or for that matter, producing facet counts) only helps surface the low-quality matches that may otherwise be lurking unseen at the long-tail end of that very large fuzzy set.
Sometimes, you can solve the user's problem a different way e.g. if you are sorting most-recent-first because you want to let a user see current content then consider the Google search approach instead - they offer users a range of filters ("past hour", "past week" etc).
These are technically more efficient to implement and users benefit from the natural relevancy ranking order keeping highest-quality matches first and the long-tail of low-quality crud out of sight.