Advice needed for searching: filters vrs. queries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Advice needed for searching: filters vrs. queries

InquiringMind
I currently implement all my production application client queries directly in Java, and use a BoolQueryBuilder to wrap all of my indexed field queries. I currently only use a filter for geo distance queries. The toString method creates a very nice pretty-printed JSON form of the search that the Java API can accept for testing and demonstration purposes.

http://jontai.me/blog/2012/10/using-elasticsearch-to-speed-up-filtering/ is an interesting article. I'm not using MongoDB; when using ElasticSearch it is my one and only DB. So keeping _source enabled is necessary. And I've already disabled the _all field and seen the greatly improved build results he sees.

But the migration from queries to filters is what caught my attention. I had already been looking at this, and have some question:

Instead of using the static QueryBuilders.boolQuery method to create a BoolQueryBuilder, I was considering using a FilterBuilders.boolFilter method to create a BoolFilterBuilder. It seems to have the the andFilter, notFilter, and orFilter counterparts to the BoolQueryBuilder's must, mustNot, and should methods. Is the only difference between queries and filters really just scoring? 

Do I really need to create a QueryBuilders.matchAll query builder and then add filters to it?

Of course, there doesn't seem to be a counterpart for phrase matching in the filter query world. So when I detect a blank inside a term string, I create a phrase match query as follows:

MatchQueryBuilder mqb = matchPhraseQuery(field, qterm.getValue());
mqb.slop(qterm.getSlop());
return mqb;

But by default, I use the fieldQueryBuilder, since it automatically recognizes strings such as A+B as a phrase, and it also recognizes certain Chinese characters as individual words of a phrase. Very nice, and fully compatible with values of one term or a phrase.

FieldQueryBuilder fqb = fieldQuery(field, qterm.getValue());
fqb.defaultOperator(FieldQueryBuilder.Operator.AND);
fqb.autoGeneratePhraseQueries(true);
fqb.enablePositionIncrements(true);
fqb.phraseSlop(qterm.getSlop());
return fqb;      

Is there some requirement or benefit to constructing a search using a top-level QueryBuilders.matchAll and adding the complex tree of filter builders to it? Or can I bypass the query builders? Or is phrase matching something that makes it impossible to generically throw either a single term or a phrase into the query (as I can easily do with query builders).

The caching isn't all that interesting: Ad-hoc queries that are complex vary widely, and are rarely the same from call to call and across clients. So once the search engine is warmed up, the non-cached steady state response times are the most interesting. (For example, a cached query can return in a few milliseconds, but the first instance of that query took 35 seconds and it only returned 2 matches across those 78M documents.)

Or do I really need to wait until I throw enough machines at this to wring out the best performance from ad-hoc complex searches?

In the meantime, my application's most commonly used query is get-by-ID (index.type.id) and that performs brilliantly fast when not cached, even for databases that approach 100M documents. So I have some time to research and experiment.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2
Hiya

> But the migration from queries to filters is what caught my attention.
> I had already been looking at this, and have some question:
>
>
> Instead of using the static QueryBuilders.boolQuery method to create a
> BoolQueryBuilder, I was considering using a FilterBuilders.boolFilter
> method to create a BoolFilterBuilder. It seems to have the the
> andFilter, notFilter, and orFilter counterparts to the
> BoolQueryBuilder's must, mustNot, and should methods. Is the only
> difference between queries and filters really just scoring?

Filters are faster, because they are simpler. They don't have to do any
scoring. On top of that, most filters can be cached in a compact bitset,
making them even faster when you reuse them.

Filters don't do full text analysis, and don't do scoring (although they
can be combined with the custom_filters_score query to influence
scoring).

So yes.  Use filters wherever you can.

Bool filter vs and/or/not:

The bool filter consumes bitsets.  Most filters produce bitsets, eg a
filter like { term: { status: "active" }} will examine every document in
the index and create a bitset for the entire index (one bit per
document) which contains '1' if the document matches, and '0' if it
doesn't.

Combining these bitsets is very efficient.

However, certain filters (geo filters and numeric_range) don't create
bitsets. They examine each doc in turn.  Running a geo-distance filter
on every doc in the index is heavy.  You want to avoid that.

and/or/not filters don't demand bitsets. They work doc-by-doc, so
they're a good fit for geo filters. They also short-circuit.  If a doc
has already been excluded by an earlier filter, it won't run the later
filters.

So to put it all together, combine the bitset filters with a bool
filter, and then combine the bool filter with the geo filter using an
'and' clause, with the geo-filter after the and (see example below)


>
>
> Do I really need to create a QueryBuilders.matchAll query builder and
> then add filters to it?

You can use a filtered query, or a constant score query:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1'  -d '
{
   "query" : {
      "constant_score" : {
         "filter" : {
            "and" : [
               {
                  "bool" : {
                     "must" : [
                        {
                           "term" : {
                              "status" : "active"
                           }
                        },
                        {
                           "range" : {
                              "date" : {
                                 "gte" : "2013-01-01"
                              }
                           }
                        }
                     ]
                  }
               },
               {
                  "geo_distance" : {
                     "distance" : "10km",
                     "location" : [
                        0,
                        0
                     ]
                  }
               }
            ]
         }
      }
   }
}
'

>
>
> Of course, there doesn't seem to be a counterpart for phrase matching
> in the filter query world. So when I detect a blank inside a term
> string, I create a phrase match query as follows:

Correct - any part of the query that relates to "full text search" is
better off handled by queries.


>
> But by default, I use the fieldQueryBuilder, since it automatically
> recognizes strings such as A+B as a phrase, and it also recognizes
> certain Chinese characters as individual words of a phrase. Very nice,
> and fully compatible with values of one term or a phrase.

The field/query_string query can be useful and powerful, but is
problematic. First, formatting the query correctly can be tricky - quite
often it is not obvious that it is not running the query that you expect
(it's a complicated syntax). Second, any syntax error will just cause
the query to fail - no results.  Third, you're exposing your search to
abuse by very heavy queries, eg "a* b* c* d* e* ..." etc

I think that search keywords should be parsed by your application to
allow just the queries that you specifically want to allow.
>

>
> Is there some requirement or benefit to constructing a search using a
> top-level QueryBuilders.matchAll and adding the complex tree of filter
> builders to it? Or can I bypass the query builders? Or is phrase
> matching something that makes it impossible to generically throw
> either a single term or a phrase into the query (as I can easily do
> with query builders).

full text search -> use queries

use filters for the kind of thing you would normally express with SQL:

   WHERE id IN (1,2,3)
     AND ( date >= '2013-01-01'  OR featured )
     AND status = 'active'

that kind of thing.

>
>
> The caching isn't all that interesting: Ad-hoc queries that are
> complex vary widely, and are rarely the same from call to call and
> across clients. So once the search engine is warmed up, the non-cached
> steady state response times are the most interesting. (For example, a
> cached query can return in a few milliseconds, but the first instance
> of that query took 35 seconds and it only returned 2 matches across
> those 78M documents.)

Queries aren't cached, but a query is faster once the data required to
run the query is loaded into the kernel filesystem caches (which I think
is what you mean).  Having lots of kernel cache space is good for
performance.  

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

InquiringMind
Thank you so much for the carefully written and detailed explanation. It will give me a lot of things to think about. And it would be an excellent first draft for a tutorial on this subject!

Taking your suggestion to avoid the match_all query, I made a small change to my test client. For this test, I was looking for all USA cities within 50km of San Jose, IL (there are two of them, including San Jose, IL itself). The only geocoded data I have right now is the list of US and Puerto Rica cities from the US Census. So I never noticed that my plain distance query (find all geocoded things within 50km of a specified center point) had a performance trap in it.

Here are the original and updated forms of that distance query. They're actually implemented in the Java API, but my test client can emit the pretty-printed JSON form that can then be pasted directly into a curl command to make everyone on this newsgroup happy!

Original query:

curl -XGET 'http://localhost:9200/census/_search?pretty=true' -d'
{
  "from" : 0,
  "size" : 20,
  "query" : {
    "filtered" : {
      "query" : {
        "bool" : {
          "must" : {
            "match_all" : { }
          }
        }
      },
      "filter" : {
        "geo_distance" : {
          "location" : [ -89.604788, 40.303962 ],
          "distance" : "10.0km",
          "distance_type" : "arc"
        }
      }
    }
  },
  "version" : true,
  "explain" : false,
  "sort" : [ {
    "_geo_distance" : {
      "location" : [ -89.604788, 40.303962 ],
      "distance_type" : "arc"
    }
  } ]
}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : null,
    "hits" : [ {
      "_index" : "census",
      "_type" : "locality",
      "_id" : "hr0GXt2ySYa_LUD2XdIPXQ",
      "_version" : 1,
      "_score" : null, "_source" : { "city" : "San Jose", "state" : "IL", "location" : [ -89.604788, 40.303962 ] },
      "sort" : [ 0.0 ]
    }, {
      "_index" : "census",
      "_type" : "locality",
      "_id" : "8OAy1qLVSLab8g48_m3QCg",
      "_version" : 1,
      "_score" : null, "_source" : { "city" : "Delavan", "state" : "IL", "location" : [ -89.545651, 40.370835 ] },
      "sort" : [ 8.967529897116258 ]
    } ]
  }
}

Updated query based on your suggestion:

curl -XGET 'http://localhost:9200/census/_search?pretty=true' -d'
{
  "from" : 0,
  "size" : 20,
  "query" : {
    "constant_score" : {
      "filter" : {
        "geo_distance" : {
          "location" : [ -89.604788, 40.303962 ],
          "distance" : "10.0km",
          "distance_type" : "arc"
        }
      }
    }
  },
  "version" : true,
  "explain" : false,
  "sort" : [ {
    "_geo_distance" : {
      "location" : [ -89.604788, 40.303962 ],
      "distance_type" : "arc"
    }
  } ]
}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : null,
    "hits" : [ {
      "_index" : "census",
      "_type" : "locality",
      "_id" : "hr0GXt2ySYa_LUD2XdIPXQ",
      "_version" : 1,
      "_score" : null, "_source" : { "city" : "San Jose", "state" : "IL", "location" : [ -89.604788, 40.303962 ] },
      "sort" : [ 0.0 ]
    }, {
      "_index" : "census",
      "_type" : "locality",
      "_id" : "8OAy1qLVSLab8g48_m3QCg",
      "_version" : 1,
      "_score" : null, "_source" : { "city" : "Delavan", "state" : "IL", "location" : [ -89.545651, 40.370835 ] },
      "sort" : [ 8.967529897116258 ]
    } ]
  }
}

They both give the same results, and after a couple of runs each, they both return in about 1 ms. But the second query was faster the first time. (Small data set makes it difficult to truly measure the performance of this particular type of query).

It's take some some time to digest the rest. Since I first wrote the client, I've finally (with some help from this newsgroup!) mastered the index settings and mappings, and have even created a tool that converts a very simple high-level schema into the settings and mappings with all the trimmings (character filters, token filters, custom analyzers, with all the right languages, as needed). This makes it very nice to change mappings on a whim during expermentation and testing. So now that I have your guidance, I can take the next steps much more easily.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2

>
> They both give the same results, and after a couple of runs each, they
> both return in about 1 ms. But the second query was faster the first
> time. (Small data set makes it difficult to truly measure the
> performance of this particular type of query).

Actually, the difference between a filtered query with match_all and a
filter, and just a constant_score query with a filter should be minimal.
Any differences you saw were probably due to file system caching.

clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

shlomivaknin
Hey, 

you are right, i dont see a difference between match_all + filter and constant_score + filter, but searching this way comes always slower then term query..

here is my example:  
"query": {
    "query_string": {
      "query": "+test +another",
    }
  } 

comes back after 2ms (average after many attempts) :
{:took 2, :timed_out false, :_shards {:total 5, :successful 5, :failed 0}, :hits {:total 1, :max_score 18.649187, :hits [{:_index test-index, :_type test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score 18.649187, :_source {...}}]}}


while both:
{:constant_score {:filter {:bool {:must [{:term {:gram "test"}} {:term {:gram "another"}}]}}}}
and
{query: match_all} {:bool {:must [{:term {:gram "the_thing_is"}} {:term {:gram "what_if_the"}}]}}

comes back after 23ms (average after many attempts):
{:took 23, :timed_out false, :_shards {:total 5, :successful 5, :failed 0}, :hits {:total 1, :max_score 1.0, :hits [{:_index test-index, :_type test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score 1.0, :_source {...}}]}}

how does that make sense?


On Friday, March 15, 2013 1:15:04 PM UTC+2, Clinton Gormley wrote:

>
> They both give the same results, and after a couple of runs each, they
> both return in about 1 ms. But the second query was faster the first
> time. (Small data set makes it difficult to truly measure the
> performance of this particular type of query).

Actually, the difference between a filtered query with match_all and a
filter, and just a constant_score query with a filter should be minimal.
Any differences you saw were probably due to file system caching.

clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2

>
> while both:
>
> {:constant_score {:filter {:bool {:must [{:term {:gram "test"}} {:term
> {:gram "another"}}]}}}}
>
> and
> {query: match_all} {:bool {:must [{:term {:gram "the_thing_is"}}
> {:term {:gram "what_if_the"}}]}}

Can you post the full search command that you use, in curl style - from
the above I don't know what phase you are using to run these particular
clauses

clint

>
>
>
> comes back after 23ms (average after many attempts):
> {:took 23, :timed_out false, :_shards {:total 5, :successful
> 5, :failed 0}, :hits {:total 1, :max_score 1.0, :hits [{:_index
> test-index, :_type test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score
> 1.0, :_source {...}}]}}
>
>
> how does that make sense?
>
>
>
> On Friday, March 15, 2013 1:15:04 PM UTC+2, Clinton Gormley wrote:
>        
>         >
>         > They both give the same results, and after a couple of runs
>         each, they
>         > both return in about 1 ms. But the second query was faster
>         the first
>         > time. (Small data set makes it difficult to truly measure
>         the
>         > performance of this particular type of query).
>        
>         Actually, the difference between a filtered query with
>         match_all and a
>         filter, and just a constant_score query with a filter should
>         be minimal.
>         Any differences you saw were probably due to file system
>         caching.
>        
>         clint
>        
>        
>        
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email].
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

shlomivaknin
sure, here are both queries:
curl -XGET http://es-test:9200/test-index/test-type/_search/  -d '{"filter":{"bool":{"must":[{"term":{"gram":"test"}},{"term":{"gram":"another"}}]}},"query":{"match_all":{}}}'
curl -XGET http://es-test:9200/test-index/test-type/_search/  -d '{"query": {"constant_score": {"filter": {"bool": {"must": [{"term": {"gram": "test"}},{"term": {"gram": "another"}}]}}}}}'

they both get me >20 ms, 

while he following:
curl -XGET http://es-test:9200/test-index/test-type/_search/  -d '{"query":{"field":{"gram":"+test +another"}}}'

returns in 2ms




On Monday, March 18, 2013 11:18:38 AM UTC+2, Clinton Gormley wrote:

>
> while both:
>
> {:constant_score {:filter {:bool {:must [{:term {:gram "test"}} {:term
> {:gram "another"}}]}}}}
>
> and
> {query: match_all} {:bool {:must [{:term {:gram "the_thing_is"}}
> {:term {:gram "what_if_the"}}]}}

Can you post the full search command that you use, in curl style - from
the above I don't know what phase you are using to run these particular
clauses

clint

>
>
>
> comes back after 23ms (average after many attempts):
> {:took 23, :timed_out false, :_shards {:total 5, :successful
> 5, :failed 0}, :hits {:total 1, :max_score 1.0, :hits [{:_index
> test-index, :_type test, :_id OG0xbcF-TEuNWCSGlSENhw, :_score
> 1.0, :_source {...}}]}}
>
>
> how does that make sense?
>
>
>
> On Friday, March 15, 2013 1:15:04 PM UTC+2, Clinton Gormley wrote:
>        
>         >
>         > They both give the same results, and after a couple of runs
>         each, they
>         > both return in about 1 ms. But the second query was faster
>         the first
>         > time. (Small data set makes it difficult to truly measure
>         the
>         > performance of this particular type of query).
>        
>         Actually, the difference between a filtered query with
>         match_all and a
>         filter, and just a constant_score query with a filter should
>         be minimal.
>         Any differences you saw were probably due to file system
>         caching.
>        
>         clint
>        
>        
>        
>
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="k84uPLMTywMJ">elasticsearc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2
On Mon, 2013-03-18 at 07:06 -0700, Shlomi wrote:
> sure, here are both queries:
> curl -XGET http://es-test:9200/test-index/test-type/_search/  -d
> '{"filter":{"bool":{"must":[{"term":{"gram":"test"}},{"term":{"gram":"another"}}]}},"query":{"match_all":{}}}'

Using the top-level filter param means:
1) return all documents in the query (match all in this case)
2) calculate facets, if any
3) then filter results


> curl -XGET http://es-test:9200/test-index/test-type/_search/  -d
> '{"query": {"constant_score": {"filter": {"bool": {"must": [{"term":
> {"gram": "test"}},{"term": {"gram": "another"}}]}}}}}'

This looks good - I'm very surprised it is taking 20ms, as I'd expect
this to return in the 2-3 ms range

>
>
> they both get me >20 ms,
>
>
> while he following:
> curl -XGET http://es-test:9200/test-index/test-type/_search/  -d
> '{"query":{"field":{"gram":"+test +another"}}}'

Queries are fast and efficient, but the performance also depends on how
many docs match.  In this case, you only have one matching doc, so they
get to show off their speed.  But if you have 10 million docs which
match, all of which need to be scored, then it'd be best to apply
filters to them before scoring.

clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

shlomivaknin


Queries are fast and efficient, but the performance also depends on how
many docs match.  In this case, you only have one matching doc, so they
get to show off their speed.  But if you have 10 million docs which
match, all of which need to be scored, then it'd be best to apply
filters to them before scoring. 

oh I see, that makes sense now, thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

phill
In reply to this post by Clinton Gormley-2
On 3/18/2013 8:02 AM, Clinton Gormley wrote:
> But if you have 10 million docs which match, all of which need to be
> scored, then it'd be best to apply filters to them before scoring.

I'm confused on how to apply filters BEFORE queries.  I always wanted to
apply filters BEFORE queries, but I don't see how to do that in the same
way that filters after queries work.

The places to put a filter are:
1. filter in the search request - as Clinton said, that really is the
last thing applied.
2. filtered query "A query that applies a filter to the results of
another query"

That sounds like post query filtering to me.
Therefore, I do NOT see where I can do filtering before scoring.

What is a normal pattern for creating queries for doing filtering BEFORE
scoring?
Is there something better than what I have below?
Wouldn't the following combine (i.e. coordination as it is called) the
score for the filtering in the 1st "must" (even if boosted or const)
with other "should"s and "must"s?
Is there any way around a pre-filtering that doesn't effect the score in
some way?
Am I overly worried about this tweaking of the score by a pre-filter
sub-expression?

Why would I want to preserve the scoring and not have it effected by the
filtering?  When a user writes a search to boost a term or phrase etc.,
it seems messy to have this other pre-filter expression
going into scoring, particularly when I am also embellishing the users
query with "helpful" phrases and spans of my own based on the users input.

My best try this morning pre-filtering.
{
     "bool"{
         "must":[
             {
                 "filtered":{
                     "query":{
                         "match_all":{

                         }
                     },
                     "filter":{
                         ... insert all of your pre-filtering here.
                     }
                 }
             },
                 ... insert all of your other "must"s here
         ],
         "should": [
             ... insert all of your "should"s here
         ]
     }
}

-Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2
Hi Paul

On Mon, 2013-03-18 at 11:47 -0700, P. Hill wrote:

> On 3/18/2013 8:02 AM, Clinton Gormley wrote:
> > But if you have 10 million docs which match, all of which need to be
> > scored, then it'd be best to apply filters to them before scoring.
>
> I'm confused on how to apply filters BEFORE queries.  I always wanted to
> apply filters BEFORE queries, but I don't see how to do that in the same
> way that filters after queries work.
>
> The places to put a filter are:
> 1. filter in the search request - as Clinton said, that really is the
> last thing applied.
> 2. filtered query "A query that applies a filter to the results of
> another query"

Don't believe the docs ;)

Pre 0.90 I believe the filter and query were executed in tandem, with
both the filter and the query advancing one doc at a time (the
"leapfrog" approach).

From 0.90 onwards, you can specify a "strategy" in the filtered query,
which can be set to:
    query_first
    random_access_always
    leap_frog
    random_access_THRESHOLD
    leap_frog_query_first
    leap_frog_filter_first

The default is to use random access where possible, and to fall back to
leap_frog_filter_first where not.

The random_access_THRESHOLD allows you to specify an integer THRESHOLD.
Not entirely sure what happens in this case, but hopefully will be
documented forthwith

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

InquiringMind
It would seem that for each indexed terms, the matching documents (hopefully, the _id references to them) should be sub-ordered by _id.

Then, a doc-at-a-time would fetch and process a single document from each. It would honor a minimum _id value and skip over any _id values less than this minimum. If no matches from a must clause, the minimum _id could then be advanced for all of the clauses. This would effectively skip huge swaths of documents that would otherwise match each clause.

I don't know if Lucene could be readily taught to do this. But I know it works from another "NoSQL" engine I created once. But it wasn't Lucene, and Lucene is where the crowd hangs out!

On Monday, March 18, 2013 3:38:32 PM UTC-4, Clinton Gormley wrote:
<snip/>

Pre 0.90 I believe the filter and query were executed in tandem, with
both the filter and the query advancing one doc at a time (the
"leapfrog" approach).

From 0.90 onwards, you can specify a "strategy" in the filtered query,
which can be set to:
    query_first
    random_access_always
    leap_frog
    random_access_THRESHOLD
    leap_frog_query_first
    leap_frog_filter_first

The default is to use random access where possible, and to fall back to
leap_frog_filter_first where not.

The random_access_THRESHOLD allows you to specify an integer THRESHOLD.
Not entirely sure what happens in this case, but hopefully will be
documented forthwith

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

phill
In reply to this post by Clinton Gormley-2
On 3/18/2013 12:38 PM, Clinton Gormley wrote:

> Hi Paul
>
> On Mon, 2013-03-18 at 11:47 -0700, P. Hill wrote:
>> On 3/18/2013 8:02 AM, Clinton Gormley wrote:
>>> But if you have 10 million docs which match, all of which need to be
>>> scored, then it'd be best to apply filters to them before scoring.
>> I'm confused on how to apply filters BEFORE queries.  I always wanted to
>> apply filters BEFORE queries, but I don't see how to do that in the same
>> way that filters after queries work.
>>
>> The places to put a filter are:
>> 1. filter in the search request - as Clinton said, that really is the
>> last thing applied.
>> 2. filtered query "A query that applies a filter to the results of
>> another query"
> Don't believe the docs ;)
>
> Pre 0.90 I believe the filter and query were executed in tandem, with
> both the filter and the query advancing one doc at a time (the
> "leapfrog" approach).

Oh that NEEDS to be documented SO BAD!
random_access?  You mention it is used, but didn't suggest what it is.  
I don't have a guess.
What is wrong with (entire) filter_first, particularly combined with a
cached filter?

-Paul

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2

> Oh that NEEDS to be documented SO BAD!
> random_access?  You mention it is used, but didn't suggest what it is.  
> I don't have a guess.
> What is wrong with (entire) filter_first, particularly combined with a
> cached filter?

This may shed some more light on it:

http://blog.mikemccandless.com/2010/09/fast-search-filters-using-flex.html

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2
In reply to this post by InquiringMind
On Mon, 2013-03-18 at 12:57 -0700, InquiringMind wrote:
> It would seem that for each indexed terms, the matching documents
> (hopefully, the _id references to them) should be sub-ordered by _id.

The main reason I gave up trying to figure out what the code does, was
that it was at odds with the comments.  So I opened this instead:

https://github.com/elasticsearch/elasticsearch/issues/2798

clint



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

InquiringMind
In reply to this post by Clinton Gormley-2
Clint,

Thanks for the link and for all your patient help and valuable information. I also found your excellent SlideShare tutorial on searching which tied together the information from your recent posts.

One additional question: Is there a similar situation with filters in that some types of filters expect their filter values to already be analyzed, while others analyze them?

One additional comment: The link you provided discusses the concept of leapfrogging which speeds the intersection of a query and a filter. Now if Lucene could apply this to the BoolQueryBuilder in which all of the must terms (at least) were processed by leapfrogging, then complex queries could be much faster. For example: (city:"New Hartford" AND (state:NY OR state:CT)) 

Just a thought... Probably somewhat naive based on my lack of experience with anything deep inside Lucene.

On Monday, March 18, 2013 4:20:53 PM UTC-4, Clinton Gormley wrote:

<snip/>

This may shed some more light on it:

http://blog.mikemccandless.com/2010/09/fast-search-filters-using-flex.html

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Advice needed for searching: filters vrs. queries

Clinton Gormley-2
Hiya
>
> One additional question: Is there a similar situation with filters in
> that some types of filters expect their filter values to already be
> analyzed, while others analyze them?

Filters are not analyzed.

>
>
> One additional comment: The link you provided discusses the concept of
> leapfrogging which speeds the intersection of a query and a filter.
> Now if Lucene could apply this to the BoolQueryBuilder in which all of
> the must terms (at least) were processed by leapfrogging, then complex
> queries could be much faster. For example: (city:"New Hartford" AND
> (state:NY OR state:CT))
>
>
> Just a thought... Probably somewhat naive based on my lack of
> experience with anything deep inside Lucene.

I have no idea :)

That said, I think where we would like to get to is to be able to
calculate the cost of individual clauses, and run the cheaper clause
first. So eg searching for "to appendiculate" would process the
"appendiculate" clause before the "to" clause.

But we're not there yet


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.