Filter aggregation and nested documents

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Filter aggregation and nested documents

Olivier B
Hi all,

I'm working with nested documents (like millions of documents) and I do aggregation on nested documents. And of course, I need to use filter aggregation (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html), however this does not seems to work with nested documents:

{
"aggs": {
      "items": {
         "nested": {
            "path": "items"
         },
         "filter": {
            "ids": {
               "values": [
                  "2AA4CE67-9469-4AE7-AC99-46F7E2646C2F"
               ]
            }
         },
         "aggs": {
            "questions": {
               "terms": {
                  "field": "items.question_label.raw",
                  "size": 0
               }
            }
         }
      }
   }
}

Response: 
Parse Failure [Found two aggregation type definitions in [items]: [nested] and [filter]. Only one type is allowed.]]; }]

So, i tried an other way:
{
   "query": {
      "filtered": {
         "filter": {
            "ids": {
               "values": [
                  "2AA4CE67-9469-4AE7-AC99-46F7E2646C2F"
               ]
            }
         }
      }
   },
   "aggs": {
      "items": {
         "nested": {
            "path": "items"
         },
         "aggs": {
            "questions": {
               "terms": {
                  "field": "items.question_label.raw",
                  "size": 0
               }
            }
         }
      }
   }
}

In that case, this is working. But:
- it takes several seconds,
- the cache is filled up very quickly
- because the cache is full, it refuses new queries (i'm using ES 1.1.1 with Circuit Breaker)
Of course, this is not acceptable for production.

So basically, i've millions of documents but i do aggregation in my example within a single documents containing around 100 documents with 10 fields and... it's taking 2Gb of memory for the data cache and takes several seconds.
My guess is, the filtering is not very useful and do aggregation on all documents before filtering (and not the contrary as I expect).

Is there any better solution for filter aggregation with nested documents?

Many thanks!



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4bf1cf1d-8f4b-41f1-add1-efa952691b64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Filter aggregation and nested documents

Binh Ly-2
You are correct. Unfortunately the fielddata is loaded for all docs regardless of filter condition. You can:

1) Add more RAM

2) Add more nodes (and shard your index out so that RAM usage will distributed across multiple nodes)

3) Use disk-based fielddata (fielddata will not be loaded into memory) for the field/s you are aggregating on. This will run slower and you have to reindex your data.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14bf25b7-a973-448a-866f-425d38001d7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Filter aggregation and nested documents

Olivier B
Thanks you. 
OK, that's what I was fearing: the cache is loaded regardless of the filter condition. Which is a shame, even if we filter a lot, targeting only one document, we still need to fill up the cache!
I will try to have a lot of RAM and see if I'm reaching a stable memory occupation and let the cache living like that. 
Alternative solution is to have many indexes, each index will act as a pre-filter and contains way less data.
Do you know if the fielddata cache is loading all docs, or only the relevant shard? Would it help to have smaller shards?

On Monday, April 28, 2014 11:55:22 PM UTC+10, Binh Ly wrote:
You are correct. Unfortunately the fielddata is loaded for all docs regardless of filter condition. You can:

1) Add more RAM

2) Add more nodes (and shard your index out so that RAM usage will distributed across multiple nodes)

3) Use disk-based fielddata (fielddata will not be loaded into memory) for the field/s you are aggregating on. This will run slower and you have to reindex your data.

<a href="http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fblog%2Fdisk-based-field-data-a-k-a-doc-values%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFVGU1eEDT8xYzd3xkD4ldn8-zs4A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fblog%2Fdisk-based-field-data-a-k-a-doc-values%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFVGU1eEDT8xYzd3xkD4ldn8-zs4A';return true;">http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6a46345d-da2e-403c-8c9f-d47de4b70bac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Filter aggregation and nested documents

x0ne-2
When fielddata is loaded, is it only that of which the aggregation job needs (items.question_label.raw in this case) or does it load the full _source of every match and extract the field?

On Monday, April 28, 2014 9:04:09 PM UTC-4, Olivier B wrote:
Thanks you. 
OK, that's what I was fearing: the cache is loaded regardless of the filter condition. Which is a shame, even if we filter a lot, targeting only one document, we still need to fill up the cache!
I will try to have a lot of RAM and see if I'm reaching a stable memory occupation and let the cache living like that. 
Alternative solution is to have many indexes, each index will act as a pre-filter and contains way less data.
Do you know if the fielddata cache is loading all docs, or only the relevant shard? Would it help to have smaller shards?

On Monday, April 28, 2014 11:55:22 PM UTC+10, Binh Ly wrote:
You are correct. Unfortunately the fielddata is loaded for all docs regardless of filter condition. You can:

1) Add more RAM

2) Add more nodes (and shard your index out so that RAM usage will distributed across multiple nodes)

3) Use disk-based fielddata (fielddata will not be loaded into memory) for the field/s you are aggregating on. This will run slower and you have to reindex your data.

<a href="http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fblog%2Fdisk-based-field-data-a-k-a-doc-values%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFVGU1eEDT8xYzd3xkD4ldn8-zs4A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fblog%2Fdisk-based-field-data-a-k-a-doc-values%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFVGU1eEDT8xYzd3xkD4ldn8-zs4A';return true;">http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/352608c0-ffbe-4fbd-ab5e-9c5809137bb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.