Understanding regexp query better to avoid query failures and OOMs
I have been trying to get my head around how Regexp Query works in Elasticsearch. To my knowledge, it uses Lucene's Regex Engine, which is limited. A problem with running regexp query on a particular field can be expensive depending upon the number of unique terms in the index for that field. So if a field has a value "brown sugar cake", and if the standard default tokenizer is in-use, then any regex expression provided in the regexp query for the field holding the mentioned value will run against brown, sugar and cake and not on the entire string. For this reason, regex in Elasticsearch (and Lucene) becomes expensive. Am I correct?
Assuming that I am, I have a further question. If the performance of Regex really depends on the number of unique terms in a field, then reducing the number of unique tokens should significantly boost up the performance. So running regexp queries on not_analyzed fields should help. But that's not the case really and regexp is still extremely slow. In my case, the field is called URL and it holds URL with the query parameters. The field is not_analyzed. In most of the cases, a simple regex is fast enough but if the regex gets slightly complicated, I never get a response from the server. I also noticed on a local ES server, that the memory starts increasing and eventually I get an OOM exception.
Another thing that is beyond my understanding is the variables on which performance of a regexp query works. Just to test that, I created a new index with just 1 document. The document looks something like this:
This query took a lot of time. Logs were showing that the GC would kicking in after every 3-5 seconds. And finally the query fails with an OOM exception. I have been trying to understand what's the reason for this query to make OOM happen. After OOM, the ES node just becomes unresponsive until the GC is actually able to clear up some m.emory. This is the exact exception I get in the logs: http://pastebin.mozilla.org/6975835.
In the above case, I understand the regex is not optimized for Elasticsearch's (or rather Lucene's) regex engine. But an unoptimized regex requires a lot of memory? I don't quite understand that.
I don't know what's causing this and I really need to understand how Regexp Queries work in Elasticsearch and how they work in Lucene.