Quantcast

Term and Has_Child Query Optimization

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Term and Has_Child Query Optimization

David Hagar

Term and Has_Child Query Optimization

I'm doing large queries ( 20 terms and 20 has_child queries) and am looking for ways to optimize the response time which is currently at 8 min on 4 million docs. A pure term query is just a few seconds. At a high level the has_child query is for collections that users create. Since they change they are in a child index. The query is meant to capture things the user "likes" in the form of terms and other users collections so I can't require any one item and I want to highly rank documents that have allot of liked terms and collections. The question is are there alternative to the method I've chosen that is faster? I've included an example.

Numbers
Documents: 4 million
Collection Items: 18 million
on two AWS m3.xlarge with ten shards

Small Example

Mapping

curl -XPUT 'http://localhost:9200/collection-test?pretty=true' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
   
        "document": {
            "properties": {
                "bodyText": { "type": "string" }
            }
        },
     
        "collection_item": {
            "_parent": { "type": "document" },
            "_all" : {"enabled" : false},
            "properties": {                
                "collection_id": { "type": "integer", "index": "not_analyzed" }
            }
        }
          
    }
}'

Documents

curl -XPUT 'http://localhost:9200/collection-test/document/1' -d '{
    "bodyText" : "Creativity is inteligence having fun - Albert Einstein"
}'

curl -XPUT 'http://localhost:9200/collection-test/document/2' -d '{
    "bodyText" : "Anything one man can imagine, other men can make real. - Jules Verne"
}'

curl -XPUT 'http://localhost:9200/collection-test/document/3' -d '{
    "bodyText" : "Man will become better when you show him what he is like. - Anton Chekhov"
}'


Collections

curl -XPOST localhost:9200/collection-test/collection_item/1?parent=1 -d '{ "collection_id" : "1" }'

curl -XPOST localhost:9200/collection-test/collection_item/2?parent=1 -d '{ "collection_id" : "2" }'
curl -XPOST localhost:9200/collection-test/collection_item/4?parent=2 -d '{ "collection_id" : "2" }'

Multiple Term and Multiple Collection Query

curl -XPOST localhost:9200/collection-test/document/_search?pretty=true -d '{
"query" : {
    "bool" : {       
        "should" : [
            {
                "term" : { "bodyText" : { "value" : "anything", "boost" : 1.0 } }
            },
            {
                "term" : { "bodyText" : { "value" : "man", "boost" : 1.0 }}
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "1" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "2" }
                    }
                }
            }
        ],
        "minimum_number_should_match" : 1
      }
    }
}'

Delete Index

curl -XDELETE 'http://localhost:9200/collection-test/'


Large Query Example

curl -XPOST localhost:9200/collection-test /document/_search?pretty=true -d '{
 "fields" : ["_id", "title","summary"],
"query" : {
    "bool" : {       
        "should" : [
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"harry potter\"^1.0" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"j.k. rowling\"^0.4083824" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"final movie\"^0.40137964" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"fantasy series\"^0.3629825" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"box office records\"^0.35038263" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"breaking dawn\"^0.11963159" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"final installment\"^0.11438772" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"film series\"^0.35038263" }
            },
            {
                "term" : { "bodyText" : { "value" : "potter", "boost" : 0.805837 } }
            },
            {
                "term" : { "bodyText" : { "value" : "deathly", "boost" : 0.46554363 }
            },
            {
                "term" : { "bodyText" : { "value" : "hallows", "boost" : 0.46430007 }}
            },
            {
                "term" : { "bodyText" : { "value" : "rowling", "boost" : 0.3994508 } }
            },
            {
                "term" : { "bodyText" : { "value" : "j.k.", "boost" : 0.39741242 }}
            },
            {
                "term" : { "bodyText" : { "value" : "pottermore", "boost" : 0.36284378 } }
            },
            {
                "term" : { "bodyText" : { "value" : "dumbledore", "boost" : 0.36096284 }}
            },
            {
                "term" : { "bodyText" : { "value" : "muggles", "boost" : 0.3579579 } }
            },
            {
                "term" : { "bodyText" : { "value" : "harry", "boost" : 0.17482029 }}
            },
            {
                "term" : { "bodyText" : { "value" : "grint", "boost" : 0.12138573 } }
            },
            {
                "term" : { "bodyText" : { "value" : "hogwarts", "boost" : 0.119226046 }}
            },
            {
                "term" : { "bodyText" : { "value" : "blackly", "boost" : 0.11385573 } }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "445" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "529" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "93" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "480" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "341" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "99" }
                    }
                }
            },
            {

                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "563" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "34" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "347" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "355" }
                    }
                }
            },
            {

                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "571" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "95" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "96" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "108" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "435" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "474" }
                    }
                }
            },
            {

                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "550" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "326" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "514" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "490" }
                    }
                }
            }

        ],

        "minimum_number_should_match" : 1
      }
    }
}'


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Term and Has_Child Query Optimization

David Hagar
Forgot to ask a question. Would one solution be to put the collection item data in a list field on the document in the same type and pay the penalty of doing adds and deletes to update but get the speedup from doing everything as a term query? Its more likely that users will add new stuff to collections so old stuff won't change as often and new docs could be separated in their own index.

On Tuesday, 26 February 2013 17:14:28 UTC-6, David Hagar wrote:

Term and Has_Child Query Optimization

I'm doing large queries ( 20 terms and 20 has_child queries) and am looking for ways to optimize the response time which is currently at 8 min on 4 million docs. A pure term query is just a few seconds. At a high level the has_child query is for collections that users create. Since they change they are in a child index. The query is meant to capture things the user "likes" in the form of terms and other users collections so I can't require any one item and I want to highly rank documents that have allot of liked terms and collections. The question is are there alternative to the method I've chosen that is faster? I've included an example.

Numbers
Documents: 4 million
Collection Items: 18 million
on two AWS m3.xlarge with ten shards

Small Example

Mapping

curl -XPUT 'http://localhost:9200/collection-test?pretty=true' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
   
        "document": {
            "properties": {
                "bodyText": { "type": "string" }
            }
        },
     
        "collection_item": {
            "_parent": { "type": "document" },
            "_all" : {"enabled" : false},
            "properties": {                
                "collection_id": { "type": "integer", "index": "not_analyzed" }
            }
        }
          
    }
}'

Documents

curl -XPUT 'http://localhost:9200/collection-test/document/1' -d '{
    "bodyText" : "Creativity is inteligence having fun - Albert Einstein"
}'

curl -XPUT 'http://localhost:9200/collection-test/document/2' -d '{
    "bodyText" : "Anything one man can imagine, other men can make real. - Jules Verne"
}'

curl -XPUT 'http://localhost:9200/collection-test/document/3' -d '{
    "bodyText" : "Man will become better when you show him what he is like. - Anton Chekhov"
}'


Collections

curl -XPOST localhost:9200/collection-test/collection_item/1?parent=1 -d '{ "collection_id" : "1" }'

curl -XPOST localhost:9200/collection-test/collection_item/2?parent=1 -d '{ "collection_id" : "2" }'
curl -XPOST localhost:9200/collection-test/collection_item/4?parent=2 -d '{ "collection_id" : "2" }'

Multiple Term and Multiple Collection Query

curl -XPOST localhost:9200/collection-test/document/_search?pretty=true -d '{
"query" : {
    "bool" : {       
        "should" : [
            {
                "term" : { "bodyText" : { "value" : "anything", "boost" : 1.0 } }
            },
            {
                "term" : { "bodyText" : { "value" : "man", "boost" : 1.0 }}
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "1" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "2" }
                    }
                }
            }
        ],
        "minimum_number_should_match" : 1
      }
    }
}'

Delete Index

curl -XDELETE 'http://localhost:9200/collection-test/'


Large Query Example

curl -XPOST localhost:9200/collection-test /document/_search?pretty=true -d '{
 "fields" : ["_id", "title","summary"],
"query" : {
    "bool" : {       
        "should" : [
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"harry potter\"^1.0" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"j.k. rowling\"^0.4083824" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"final movie\"^0.40137964" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"fantasy series\"^0.3629825" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"box office records\"^0.35038263" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"breaking dawn\"^0.11963159" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"final installment\"^0.11438772" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"film series\"^0.35038263" }
            },
            {
                "term" : { "bodyText" : { "value" : "potter", "boost" : 0.805837 } }
            },
            {
                "term" : { "bodyText" : { "value" : "deathly", "boost" : 0.46554363 }
            },
            {
                "term" : { "bodyText" : { "value" : "hallows", "boost" : 0.46430007 }}
            },
            {
                "term" : { "bodyText" : { "value" : "rowling", "boost" : 0.3994508 } }
            },
            {
                "term" : { "bodyText" : { "value" : "j.k.", "boost" : 0.39741242 }}
            },
            {
                "term" : { "bodyText" : { "value" : "pottermore", "boost" : 0.36284378 } }
            },
            {
                "term" : { "bodyText" : { "value" : "dumbledore", "boost" : 0.36096284 }}
            },
            {
                "term" : { "bodyText" : { "value" : "muggles", "boost" : 0.3579579 } }
            },
            {
                "term" : { "bodyText" : { "value" : "harry", "boost" : 0.17482029 }}
            },
            {
                "term" : { "bodyText" : { "value" : "grint", "boost" : 0.12138573 } }
            },
            {
                "term" : { "bodyText" : { "value" : "hogwarts", "boost" : 0.119226046 }}
            },
            {
                "term" : { "bodyText" : { "value" : "blackly", "boost" : 0.11385573 } }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "445" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "529" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "93" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "480" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "341" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "99" }
                    }
                }
            },
            {

                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "563" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "34" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "347" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "355" }
                    }
                }
            },
            {

                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "571" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "95" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "96" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "108" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "435" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "474" }
                    }
                }
            },
            {

                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "550" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                    
                    "boost": "1.0",
                    "query" : {
                        "term" : { "collection_id" : "326" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "514" }
                    }
                }
            },
            {
                "has_child" : {       
                    "type" : "collection_item",
                   
                    "boost": "1.0",
                    "query" : {  
                        "term" : { "collection_id" : "490" }
                    }
                }
            }

        ],

        "minimum_number_should_match" : 1
      }
    }
}'


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Term and Has_Child Query Optimization

Clinton Gormley-2
In reply to this post by David Hagar
Hi David

> I'm doing large queries ( 20 terms and 20 has_child queries) and am
> looking for ways to optimize the response time which is currently at 8
> min on 4 million docs. A pure term query is just a few seconds. At a
> high level the has_child query is for collections that users create.
> Since they change they are in a child index. The query is meant to
> capture things the user "likes" in the form of terms and other users
> collections so I can't require any one item and I want to highly rank
> documents that have allot of liked terms and collections. The question
> is are there alternative to the method I've chosen that is faster?
> I've included an example.

Your "large" example is performing a lot of queries, in an inefficient
manner. Use filters whenever possible - filters can be cached, while
queries cannot.

For these term queries, you could rewrite them as a custom_filter_score
query, so that they contribute to scoring, but perform more efficiently,
because they are cached filters:
>             {
>                 "term" : { "bodyText" : { "value" : "potter",
> "boost" : 0.805837 } }
>             },
>             {
>                 "term" : { "bodyText" : { "value" : "deathly",
> "boost" : 0.46554363 }
>             },

eg:

    custom_filters_score: {
        query: { .... your full text queries ... },
        score_mode: "multiply",
        filters: [{
            boost: 0.805837,
            filter: { term: { bodyText: "potter" }}
        },{
            ... etc ...
        }]
    }


Similarly, use has_child filters instead of queries, and wrap all the
clauses into a single has_child clause:

    { filtered: {
       query: { custom_filters_score: {... query from above ... }},
       filter: {
           has_child: {
               filter: {
                   terms: { collection_id: [550,490,....]}
               }
           }
       }
    }

I'm not sure of your intention with has_child.  Do you want to check
whether it has children in any of these collections (ie yes/no) or do
you want the document to score higher the more collections it has?

The former is handled by my filter above. The latter you could rewrite
as a custom_filters_score query which is passed to a has_child query:

    { has_child: {
         query: {
            custom_filters_score: { the query above },
         },
         score_mode: "total",
         filters: [
           { filter: {term: { collection_id: 550}}, boost: 1},
           { filter: {term: { collection_id: 490}}, boost: 1},
           etc
         ]
     }}


Also, it's curious that you're doing 'term' queries on the bodyText,
because term queries look for exact terms, but your field is analyzed.
So for instance, this clause will never match:

"term" : { "bodyText" : { "value" : "j.k.", "boost" : 0.39741242 }}

The text "J.K." would be indexed as the terms ["j","k"], so there is no
"j.k." term to be found.

Also, the more queries you do, the more work Elasticsearch has to do,
and the longer searches will take.

hth

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Term and Has_Child Query Optimization

David Hagar

The goal is to return the documents that have the most words and that are in the most collections without requiring either. Each of the words and collections also has a relevance weight driven by the user. I think of it as a search vector of weighted terms and collection tags where the scoring finds the docs that are most similar doing the appropriate weighting for frequency of terms and collection sizes. If I were solving this without an index I would just compute the cosine distance between the search vector and each doc vector where both have weighted and normalized terms and tags.

My ideal doc score is similar to a dot product:

doc score = search_term_boost_1 * lucene_weighted_freq_of_term_1_in_doc + search_term_boost_2 * lucene_weighted_freq_of_term_2_in_doc + ... +
                   balanceFactor * ( search_collection_boost_1 * (collection_1_has_doc/collection_size_1) + search_collection_boost_2 * (collection_2_has_doc/collection_size_2) ... )

where collection_x_has_doc is 1 if doc is in collection and 0 if not.

In the use of custom_filters_score, doesn't it require all results to pass at least one of the filters? I need to have a result just match words if thats all there is. Also in your examples of switching to filters, the filter restricts the result of a query. In my case I don't have a root query to be restricted, just a bunch of doc attributes (terms and found in collections) that are a representative of what I'm looking for but are not required.

Thanks for spotting the "j.k." tokenization issue. My simplification of our schema left out our custom tokenizer, which sees it as a single token.

I've tried this single has_child with a sum score_type and the performance is the same. Here is the example.

curl -XPOST localhost:9200/neuron/document/_search?pretty=true -d '{
 "fields" : ["_id", "title","summary"],
"query" : {
    "bool" : {      
        "should" : [
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"harry potter\"^1.0" }
            },
            {
            "query_string" : { "default_field" : "bodyText", "query" : "\"j.k. rowling\"^0.4083824" }
            },
           …           

           {

                "has_child" : {      
                    "type" : "collection_item",
                    "boost": "1.0",
                    "score_type" : "sum",
                    "query" : {
                      "bool" : {      
                          "should" : [
                                { "term" : { "collection_id" : { "value" : "93", "boost" : 1.0 }}},
                                { "term" : { "collection_id" : { "value" : "480", "boost" : 1.0 }}},                                 
                                ...                            

                                { "term" : { "collection_id" : { "value" : "529", "boost" : 1.0 }}}                                 
                             ],


                        "minimum_number_should_match" : 1
                       }
                    }
                }
            }
        ],
        "minimum_number_should_match" : 1
      }
    }
}'


This is my third question here and you've answered all three and its much appreciated. :)



On Wednesday, 27 February 2013 04:37:36 UTC-6, Clinton Gormley wrote:
Hi David

> I'm doing large queries ( 20 terms and 20 has_child queries) and am
> looking for ways to optimize the response time which is currently at 8
> min on 4 million docs. A pure term query is just a few seconds. At a
> high level the has_child query is for collections that users create.
> Since they change they are in a child index. The query is meant to
> capture things the user "likes" in the form of terms and other users
> collections so I can't require any one item and I want to highly rank
> documents that have allot of liked terms and collections. The question
> is are there alternative to the method I've chosen that is faster?
> I've included an example.

Your "large" example is performing a lot of queries, in an inefficient
manner. Use filters whenever possible - filters can be cached, while
queries cannot.

For these term queries, you could rewrite them as a custom_filter_score
query, so that they contribute to scoring, but perform more efficiently,
because they are cached filters:
>             {
>                 "term" : { "bodyText" : { "value" : "potter",
> "boost" : 0.805837 } }
>             },
>             {
>                 "term" : { "bodyText" : { "value" : "deathly",
> "boost" : 0.46554363 }
>             },

eg:

    custom_filters_score: {
        query: { .... your full text queries ... },
        score_mode: "multiply",
        filters: [{
            boost: 0.805837,
            filter: { term: { bodyText: "potter" }}
        },{
            ... etc ...
        }]
    }


Similarly, use has_child filters instead of queries, and wrap all the
clauses into a single has_child clause:

    { filtered: {
       query: { custom_filters_score: {... query from above ... }},
       filter: {
           has_child: {
               filter: {
                   terms: { collection_id: [550,490,....]}
               }
           }
       }
    }

I'm not sure of your intention with has_child.  Do you want to check
whether it has children in any of these collections (ie yes/no) or do
you want the document to score higher the more collections it has?

The former is handled by my filter above. The latter you could rewrite
as a custom_filters_score query which is passed to a has_child query:

    { has_child: {
         query: {
            custom_filters_score: { the query above },
         },
         score_mode: "total",
         filters: [
           { filter: {term: { collection_id: 550}}, boost: 1},
           { filter: {term: { collection_id: 490}}, boost: 1},
           etc
         ]
     }}


Also, it's curious that you're doing 'term' queries on the bodyText,
because term queries look for exact terms, but your field is analyzed.
So for instance, this clause will never match:

"term" : { "bodyText" : { "value" : "j.k.", "boost" : 0.39741242 }}

The text "J.K." would be indexed as the terms ["j","k"], so there is no
"j.k." term to be found.

Also, the more queries you do, the more work Elasticsearch has to do,
and the longer searches will take.

hth

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Term and Has_Child Query Optimization

Clinton Gormley-2
Hiya

> In the use of custom_filters_score, doesn't it require all results to
> pass at least one of the filters? I need to have a result just match
> words if thats all there is.

No it doesn't. The custom_filters_score will tweak the score IF a filter
matches, but no match is required.

> Also in your examples of switching to filters, the filter restricts
> the result of a query. In my case I don't have a root query to be
> restricted, just a bunch of doc attributes (terms and found in
> collections) that are a representative of what I'm looking for but are
> not required.

Well, your root query is the bool query containing the full text queries
(eg "harry potter" etc).

So the idea is to use queries just for the full text part, and filters
for everything else.  A filter either matches or it doesn't, so you can
use that to apply a boost or not.  It won't take TF/IDF into account at
all, which I *think* (not having understood the dot product bit :) would
serve your purposes.


> I've tried this single has_child with a sum score_type and the
> performance is the same. Here is the example.

OK - wasn't sure if using multiple has_child's was having a big impact,
but it looks like it is just the number of queries.  The more you can
use filters the better (assuming your filters will be reused in
subsequent searches -- filters are slightly faster anyway, but their
major contribution to performance is through caching).


> This is my third question here and you've answered all three and its
> much appreciated. :)

Good questions are always welcome :)

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Term and Has_Child Query Optimization

David Hagar

My problem may seem complicated but its actually simple. I want to query with lots of terms and collection tags/ids on the document that behave scoring wise like plain words without requiring any of them. Its complicated because my collections change.

What do you think of moving the child index collection data into a document field and pay the price of updating/reindexing the whole doc whenever a doc is added to a collection? Running lots of plain term searches works well.  Have you seen numbers like X updates per hour on a index of size Y reduces performance Z percent?

Also custom_filters_score, from what I'm reading overrides the query score for what its filtering. If I have words A, B, and C and collections 1, 2, and 3 and there is only one doc in common between B and 2 and no other words or collections share a doc, that one doc should score the highest but all the docs with A,B,C,1,2,and,3 should be in the results.



On Friday, 1 March 2013 06:35:51 UTC-6, Clinton Gormley wrote:
Hiya

> In the use of custom_filters_score, doesn't it require all results to
> pass at least one of the filters? I need to have a result just match
> words if thats all there is.

No it doesn't. The custom_filters_score will tweak the score IF a filter
matches, but no match is required.

> Also in your examples of switching to filters, the filter restricts
> the result of a query. In my case I don't have a root query to be
> restricted, just a bunch of doc attributes (terms and found in
> collections) that are a representative of what I'm looking for but are
> not required.

Well, your root query is the bool query containing the full text queries
(eg "harry potter" etc).

So the idea is to use queries just for the full text part, and filters
for everything else.  A filter either matches or it doesn't, so you can
use that to apply a boost or not.  It won't take TF/IDF into account at
all, which I *think* (not having understood the dot product bit :) would
serve your purposes.


> I've tried this single has_child with a sum score_type and the
> performance is the same. Here is the example.

OK - wasn't sure if using multiple has_child's was having a big impact,
but it looks like it is just the number of queries.  The more you can
use filters the better (assuming your filters will be reused in
subsequent searches -- filters are slightly faster anyway, but their
major contribution to performance is through caching).


> This is my third question here and you've answered all three and its
> much appreciated. :)

Good questions are always welcome :)

clint


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Term and Has_Child Query Optimization

Clinton Gormley-2

> What do you think of moving the child index collection data into a
> document field and pay the price of updating/reindexing the whole doc
> whenever a doc is added to a collection?

That's certainly an option, and it would improve performance. The amount
of performance improvement I wouldn't know.

>  Running lots of plain term searches works well.  Have you seen
> numbers like X updates per hour on a index of size Y reduces
> performance Z percent?

Elasticsearch handles continual updating of data very well. But again,
exact numbers are hard to pin down. It'd be a case of try it and see.

>
> Also custom_filters_score, from what I'm reading overrides the query
> score for what its filtering.

No, it's COMBINED with the query score.  The filters are used to tweak
the _score returned by the query. How the filters affect the score can
be controlled with the score_mode parameter.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Loading...