slow filter execution

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

slow filter execution

Kireet Reddy
One of my queries has been consistently taking 500ms-1s and I can't figure out why. Here is the query (it looks a bit strange as I have removed things that didn't seem to affect execution time). When I remove the range filter, the query consistently takes < 10ms. The query itself only results 1 hit with or without the range filter, so I am not sure why simply including this filter adds so much time. My nodes are not experiencing any filter cache evictions. I also tried moving it to the bool section with no luck. Changing execution to "fielddata" does improve execution time to < 10ms though. Since I am sorting on the same field, I suppose this should be fine. But I would like to understand why the slowdown occurs. The published field is a date type and has eager field data loading enabled.

Thanks
Kireet


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/994f4700-7a52-4db4-a2a7-d252732517bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

dadoonet
Any chance your filter value changes for every call?
Or are you using exactly the same value each time?

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 30 juil. 2014 à 05:03, Kireet Reddy <[hidden email]> a écrit :

One of my queries has been consistently taking 500ms-1s and I can't figure out why. Here is the query (it looks a bit strange as I have removed things that didn't seem to affect execution time). When I remove the range filter, the query consistently takes < 10ms. The query itself only results 1 hit with or without the range filter, so I am not sure why simply including this filter adds so much time. My nodes are not experiencing any filter cache evictions. I also tried moving it to the bool section with no luck. Changing execution to "fielddata" does improve execution time to < 10ms though. Since I am sorting on the same field, I suppose this should be fine. But I would like to understand why the slowdown occurs. The published field is a date type and has eager field data loading enabled.

Thanks
Kireet


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/994f4700-7a52-4db4-a2a7-d252732517bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CE4B26B8-5837-46C5-9E89-2AFBADED9BB6%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

Kireet Reddy

For my test case it's the same every time. In the "real" query it will change every time, but I planned to not cache this filter and have a less granular date filter in the bool filter that would be cached. However while debugging I noticed slowness with the date range filters even while testing with the same value repeatedly.

On Jul 29, 2014 10:49 PM, "David Pilato" <[hidden email]> wrote:
Any chance your filter value changes for every call?
Or are you using exactly the same value each time?

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 30 juil. 2014 à 05:03, Kireet Reddy <[hidden email]> a écrit :

One of my queries has been consistently taking 500ms-1s and I can't figure out why. Here is the query (it looks a bit strange as I have removed things that didn't seem to affect execution time). When I remove the range filter, the query consistently takes < 10ms. The query itself only results 1 hit with or without the range filter, so I am not sure why simply including this filter adds so much time. My nodes are not experiencing any filter cache evictions. I also tried moving it to the bool section with no luck. Changing execution to "fielddata" does improve execution time to < 10ms though. Since I am sorting on the same field, I suppose this should be fine. But I would like to understand why the slowdown occurs. The published field is a date type and has eager field data loading enabled.

Thanks
Kireet


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/994f4700-7a52-4db4-a2a7-d252732517bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/N0z5eZRPO2A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CE4B26B8-5837-46C5-9E89-2AFBADED9BB6%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACkKG4iMwtd-i_NE2mWM6Ce3WeEGM_cpsJXzFsdOUc5n_PTU-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

dadoonet
May be a stupid question: why did you put that filter inside a query and not within the same filter you have at the end?

For my test case it's the same every time. In the "real" query it will change every time, but I planned to not cache this filter and have a less granular date filter in the bool filter that would be cached. However while debugging I noticed slowness with the date range filters even while testing with the same value repeatedly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

Clinton Gormley-2
Don't use the `and` filter - use the `bool` filter instead.  They have different execution modes and the `bool` filter works best with bitset filters (but also knows how to handle non-bitset filters like geo etc).  

Just remove the `and`, `or` and `not` filters from your DSL vocabulary.

Also, not sure why you are ANDing with a match_all filter - that doesn't make much sense.

Depending on which version of ES you're using, you may be encountering a bug in the filtered query which ended up always running the query first, instead of the filter. This was fixed in v1.2.0 https://github.com/elasticsearch/elasticsearch/issues/6247 .  If you are on an earlier version you can force filter-first execution manually by specifying a "strategy" of "random_access_100".  See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy

In summary, (and taking your less granular datetime clause into account) your query would be better written as:

    GET /_search
    {
      "query": {
        "filtered": {
          "strategy": "random_access_100",  #### pre 1.2 only
          "filter": {
            "bool": {
              "must": [
                {
                  "terms": {
                    "source_id": [ "s1", "s2", "s3" ]
                  }
                },
                {
                  "range": {
                    "published": {
                      "gte": "now-1d/d"  #### coarse grained, cached
                    }
                  }
                },
                {
                  "range": {
                    "published": {
                      "gte": "now-30m" #### fine grained, not cached, could use fielddata too
                    },
                    "_cache": false
                  }
                }
              ]
            }
          }
        }
      }
    }





On 30 July 2014 10:55, David Pilato <[hidden email]> wrote:
May be a stupid question: why did you put that filter inside a query and not within the same filter you have at the end?


For my test case it's the same every time. In the "real" query it will change every time, but I planned to not cache this filter and have a less granular date filter in the bool filter that would be cached. However while debugging I noticed slowness with the date range filters even while testing with the same value repeatedly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRQ6tyciPDVKVnCz0nzgq9B89y6irh3N1Ergf-oCW2Z%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

Kireet Reddy
Thanks for the detailed reply. 

I am a bit confused about and vs bool filter execution. I read this post on the elasticsearch blog. From that, I thought the bool filter would work by basically creating a bitset for the entire segment(s) being examined. If the filter value changes every time, will this still be cheaper than an AND filter that will just examine the matching docs? My segments can be very big and this query for example on matched one document.

There is no match_all query filter, There is a "match" query filter on a field named "all". :)

Based on your feedback, I moved all filters, including the query filter, into the bool filter. However it didn't change things: the query takes an order of magnitude slower with the range filter, unless I set execution to fielddata. I am using 1.2.2, I tried the strategy anyways and it didn't make a difference.

{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "terms": {
                                "source_id": ["s1", "s2", "s3"]
                            }
                        },
                        {
                            "query": {
                                "match": {
                                    "all": {
                                        "query": "foo"
                                    }
                                }
                            }
                        },
                        {
                            "range": {
                                "published": {
                                    "to": 1406064191883
                                }
                            }
                        }
                    ]
                }
            }
        }
    },
    "sort": [
        {
            "crawlDate": {
                "order": "desc"
            }
        }
    ]
}

On Wednesday, July 30, 2014 4:30:10 AM UTC-7, Clinton Gormley wrote:
Don't use the `and` filter - use the `bool` filter instead.  They have different execution modes and the `bool` filter works best with bitset filters (but also knows how to handle non-bitset filters like geo etc).  

Just remove the `and`, `or` and `not` filters from your DSL vocabulary.

Also, not sure why you are ANDing with a match_all filter - that doesn't make much sense.

Depending on which version of ES you're using, you may be encountering a bug in the filtered query which ended up always running the query first, instead of the filter. This was fixed in v1.2.0 <a href="https://github.com/elasticsearch/elasticsearch/issues/6247" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6247\46sa\75D\46sntz\0751\46usg\75AFQjCNFzvRP9KjlE6ujaKd6_D5oLs3FoVA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6247\46sa\75D\46sntz\0751\46usg\75AFQjCNFzvRP9KjlE6ujaKd6_D5oLs3FoVA';return true;">https://github.com/elasticsearch/elasticsearch/issues/6247 .  If you are on an earlier version you can force filter-first execution manually by specifying a "strategy" of "random_access_100".  See <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-filtered-query.html%23_filter_strategy\46sa\75D\46sntz\0751\46usg\75AFQjCNFgk4pGk1tP6PaKt9UWKJz6LzwbNw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-filtered-query.html%23_filter_strategy\46sa\75D\46sntz\0751\46usg\75AFQjCNFgk4pGk1tP6PaKt9UWKJz6LzwbNw';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy

In summary, (and taking your less granular datetime clause into account) your query would be better written as:

    GET /_search
    {
      "query": {
        "filtered": {
          "strategy": "random_access_100",  #### pre 1.2 only
          "filter": {
            "bool": {
              "must": [
                {
                  "terms": {
                    "source_id": [ "s1", "s2", "s3" ]
                  }
                },
                {
                  "range": {
                    "published": {
                      "gte": "now-1d/d"  #### coarse grained, cached
                    }
                  }
                },
                {
                  "range": {
                    "published": {
                      "gte": "now-30m" #### fine grained, not cached, could use fielddata too
                    },
                    "_cache": false
                  }
                }
              ]
            }
          }
        }
      }
    }





On 30 July 2014 10:55, David Pilato <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Ek3wiQzaSJIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">da...@...> wrote:
May be a stupid question: why did you put that filter inside a query and not within the same filter you have at the end?


For my test case it's the same every time. In the "real" query it will change every time, but I planned to not cache this filter and have a less granular date filter in the bool filter that would be cached. However while debugging I noticed slowness with the date range filters even while testing with the same value repeatedly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="Ek3wiQzaSJIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/085e268b-348a-4237-98f4-1c4dd56f7be1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

Kireet Reddy
Quick update, I found that if I explicitly set _cache to true, things seem to work more as expected, i.e. subsequent executions of the query sped up. I looked at DateFieldMapper.rangeFilter() and to me it looks like if a number is passed, caching will be disabled unless it's explicitly set to true. Not sure if this has been fixed in 1.3.x yet or not. This meshes with my observed behavior. 

On Wednesday, July 30, 2014 8:59:37 AM UTC-7, Kireet Reddy wrote:
Thanks for the detailed reply. 

I am a bit confused about and vs bool filter execution. I read this <a href="http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fblog%2Fall-about-elasticsearch-filter-bitsets%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFS7cjk4ObkGIGNr_AHcUSnBtR4yA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fblog%2Fall-about-elasticsearch-filter-bitsets%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFS7cjk4ObkGIGNr_AHcUSnBtR4yA';return true;">post on the elasticsearch blog. From that, I thought the bool filter would work by basically creating a bitset for the entire segment(s) being examined. If the filter value changes every time, will this still be cheaper than an AND filter that will just examine the matching docs? My segments can be very big and this query for example on matched one document.

There is no match_all query filter, There is a "match" query filter on a field named "all". :)

Based on your feedback, I moved all filters, including the query filter, into the bool filter. However it didn't change things: the query takes an order of magnitude slower with the range filter, unless I set execution to fielddata. I am using 1.2.2, I tried the strategy anyways and it didn't make a difference.

{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "terms": {
                                "source_id": ["s1", "s2", "s3"]
                            }
                        },
                        {
                            "query": {
                                "match": {
                                    "all": {
                                        "query": "foo"
                                    }
                                }
                            }
                        },
                        {
                            "range": {
                                "published": {
                                    "to": 1406064191883
                                }
                            }
                        }
                    ]
                }
            }
        }
    },
    "sort": [
        {
            "crawlDate": {
                "order": "desc"
            }
        }
    ]
}

On Wednesday, July 30, 2014 4:30:10 AM UTC-7, Clinton Gormley wrote:
Don't use the `and` filter - use the `bool` filter instead.  They have different execution modes and the `bool` filter works best with bitset filters (but also knows how to handle non-bitset filters like geo etc).  

Just remove the `and`, `or` and `not` filters from your DSL vocabulary.

Also, not sure why you are ANDing with a match_all filter - that doesn't make much sense.

Depending on which version of ES you're using, you may be encountering a bug in the filtered query which ended up always running the query first, instead of the filter. This was fixed in v1.2.0 <a href="https://github.com/elasticsearch/elasticsearch/issues/6247" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6247\46sa\75D\46sntz\0751\46usg\75AFQjCNFzvRP9KjlE6ujaKd6_D5oLs3FoVA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6247\46sa\75D\46sntz\0751\46usg\75AFQjCNFzvRP9KjlE6ujaKd6_D5oLs3FoVA';return true;">https://github.com/elasticsearch/elasticsearch/issues/6247 .  If you are on an earlier version you can force filter-first execution manually by specifying a "strategy" of "random_access_100".  See <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-filtered-query.html%23_filter_strategy\46sa\75D\46sntz\0751\46usg\75AFQjCNFgk4pGk1tP6PaKt9UWKJz6LzwbNw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-filtered-query.html%23_filter_strategy\46sa\75D\46sntz\0751\46usg\75AFQjCNFgk4pGk1tP6PaKt9UWKJz6LzwbNw';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy

In summary, (and taking your less granular datetime clause into account) your query would be better written as:

    GET /_search
    {
      "query": {
        "filtered": {
          "strategy": "random_access_100",  #### pre 1.2 only
          "filter": {
            "bool": {
              "must": [
                {
                  "terms": {
                    "source_id": [ "s1", "s2", "s3" ]
                  }
                },
                {
                  "range": {
                    "published": {
                      "gte": "now-1d/d"  #### coarse grained, cached
                    }
                  }
                },
                {
                  "range": {
                    "published": {
                      "gte": "now-30m" #### fine grained, not cached, could use fielddata too
                    },
                    "_cache": false
                  }
                }
              ]
            }
          }
        }
      }
    }





On 30 July 2014 10:55, David Pilato <[hidden email]> wrote:
May be a stupid question: why did you put that filter inside a query and not within the same filter you have at the end?


For my test case it's the same every time. In the "real" query it will change every time, but I planned to not cache this filter and have a less granular date filter in the bool filter that would be cached. However while debugging I noticed slowness with the date range filters even while testing with the same value repeatedly.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/af76ca41-9045-4a4f-b82c-b9c86d964ace%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8f8347e-57c3-4e1e-9a71-b6d9ccc7068a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: slow filter execution

Clinton Gormley-2

On 31 July 2014 20:25, Kireet Reddy <[hidden email]> wrote:
Quick update, I found that if I explicitly set _cache to true, things seem to work more as expected, i.e. subsequent executions of the query sped up. I looked at DateFieldMapper.rangeFilter() and to me it looks like if a number is passed, caching will be disabled unless it's explicitly set to true. Not sure if this has been fixed in 1.3.x yet or not. This meshes with my observed behavior. 

Nice catch!!!



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKSuS6f28kmXT_b3LFvCZJG1-_ui2D%3Drf-rojn4x6Mf%2Brw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.