Fine-tuning search

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Fine-tuning search

Clinton Gormley
Hiya

I've indexed my docs with two 'keyword' fields:
 - name (with index boost of 1.3)
 - text

Also, the 'all' field is enabled.

Some of the docs have the name field filled in (in which case, this is
the most important field) and others not, in which case we just have the
keywords in the text field.

Here are four example results for a search on 'john smith', in
desecending order of relevance:
--------------------------------------------------

1) Name:  John Smith
   Text:  John Smith passed away peacefully on March 20, aged 82.
          Funeral service will be held on Tuesday, April 3 in ....

2) Name:  John Smith
   Text:  Passed away peacefully on March 20, aged 82. Funeral service
          will be held on Tuesday, April 3 in ....

3) Name:  ''
   Text:  John Smith passed away peacefully on March 20, aged 82.
          Funeral service will be held on Tuesday, April 3 in ....

4) Name:  Maggie Smith
   Text:  Maggie Smith passed away peacefully on March 20, aged 82.
          Sadly missed by husband John

A naive search for 'john smith' on the 'all' field favours doc (4) over
doc (3).

I'm trying to apply this logic, in descending order of importance:
 - all the words close together in the name field
 - all the words close together in the text field, if the doc
   doesn't have a name field
 - as many words as possible in the 'all' field

Does this query achieve that? Any way of improving it?

curl -XGET 'http://127.0.0.0:9200/ia_object/notice/_search?searchType=dfs_query_then_fetch'  -d '
{
   "sort" : [
      "score"
   ],
   "fields" : [],
   "query" : {
      "filteredQuery" : {
         "filter" : {
            "bool" : {
               "must" : [
                  {
                     "term" : {
                        "status" : "active"
                     }
                  },
                  {
                     "term" : {
                        "location_id" : "23"
                     }
                  }
               ]
            }
         },
         "query" : {
            "disMax" : {
               "tieBreaker" : "0.7",
               "queries" : [
                  {
                     "queryString" : {
                        "fields" : [
                           "name"
                        ],
                        "boost" : "1.3",
                        "query" : "\"john smith\"~4"
                     }
                  },
                  {
                     "filteredQuery" : {
                        "filter" : {
                           "term" : {
                              "has_name" : "0"
                           }
                        },
                        "query" : {
                           "queryString" : {
                              "fields" : [
                                 "text"
                              ],
                              "boost" : "1.5",
                              "query" : "\"john smith\"~4"
                           }
                        }
                     }
                  },
                  {
                     "queryString" : {
                        "boost" : 1,
                        "query" : "john smith"
                     }
                  }
               ]
            }
         }
      }
   },
   "from" : 0,
   "size" : "100"
}
'

thanks

Clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Reply | Threaded
Open this post in threaded view
|

Re: Fine-tuning search

Clinton Gormley
I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

        Alfred Edward Rowe South Shields: Obituary
       
        Alfred Edward Rowe South Shields. Passed away on August 19.
        2009, aged 75. For our kind, gentle and always loving father.
        From your girls Debra, Janet and Carole. Respected father in law
        of Frank,...
       
       
        Obituary
       
        ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
        2007. Funeral service at Portchester Crematorium on Wednesday,
        February 28, 2007, at 1.00 p.m. No flowers please. Donations,
        if...
       
However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):
http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

thanks

Clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Reply | Threaded
Open this post in threaded view
|

Re: Fine-tuning search

kimchy
Administrator
Hi,

  First two things. If you get the latest master, filteredQuery was renamed to filtered. Also, there is a field query, that is the same as queryString with one field (should make things a bit more readable).

  I would actually play more with the tieBreaker, try and give it lower values, so if it helps. In general, that the tweaking you do with Lucene to try and nail the perfect matching. I would also play with adding another query, a phrase query, with a slop of 3 or 4.

-shay.banon

On Wed, Mar 24, 2010 at 8:00 PM, Clinton Gormley <[hidden email]> wrote:
I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

       Alfred Edward Rowe South Shields: Obituary

       Alfred Edward Rowe South Shields. Passed away on August 19.
       2009, aged 75. For our kind, gentle and always loving father.
       From your girls Debra, Janet and Carole. Respected father in law
       of Frank,...


       Obituary

       ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
       2007. Funeral service at Portchester Crematorium on Wednesday,
       February 28, 2007, at 1.00 p.m. No flowers please. Donations,
       if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):
http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

thanks

Clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


Reply | Threaded
Open this post in threaded view
|

Re: Fine-tuning search

egaumer
In reply to this post by Clinton Gormley
On Wed, Mar 24, 2010 at 2:00 PM, Clinton Gormley <[hidden email]> wrote:
I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

       Alfred Edward Rowe South Shields: Obituary

       Alfred Edward Rowe South Shields. Passed away on August 19.
       2009, aged 75. For our kind, gentle and always loving father.
       From your girls Debra, Janet and Carole. Respected father in law
       of Frank,...


       Obituary

       ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
       2007. Funeral service at Portchester Crematorium on Wednesday,
       February 28, 2007, at 1.00 p.m. No flowers please. Donations,
       if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):
http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

Essentially what you want is to boost documents where these words appear closer together. At the same time you want matches in the name field to be boosted above matches in the text field.

I would think something along the lines of:


    "queryString" : { 
        "fields" : ["name^5", "text"], 
        "query" : "edward rowe crematorium",
        "phraseSlop" : "15",
        "useDisMax" : true
    }

Would suffice. 

I think your slop factor is too low (keep in mind this represents the max allowable distance). Bump it up a bit because the closer your terms, the higher the score (regardless of the max slop factor). In the link you provided, 4 is too low to generate a valid match.

The example above should give higher scores to documents where all these terms appear closer and at the same time, boost documents that have matches in the name field.

I haven't studied the elasticsearch DSL in depth but this is general logic you'd use to tune relevancy regardless of the query language you're using.

Start with weighting based on proximity (i.e. the closer the terms the higher the score), then boost specific fields that are more relevant (referred to as "context" weighting).

With Lucene, proximity is achieved via sloppy phrases and context weight is achieved via DisjunctionMaxQuery.

Regards,
-Eric

Reply | Threaded
Open this post in threaded view
|

Re: Fine-tuning search

kimchy
Administrator
I was giving a low value for slop since I suggested to match it against the name field (where long names are not probable). Note, a queryString of "something else something" is not translated in lucene to a phrase query  but to a boolean OR/AND query (depending on the defaultOperator), to do that, you need to do "\"something else something\"", but then you loose options of the query parser, and you are probably better using the phrase query if you want phrase queries. But, your idea is good in terms of guidelines in what to try and achieve.

-shay.banon

On Wed, Mar 24, 2010 at 9:12 PM, Eric Gaumer <[hidden email]> wrote:
On Wed, Mar 24, 2010 at 2:00 PM, Clinton Gormley <[hidden email]> wrote:
I've implemented the query mentioned in my previous email, but it could
do with some improvement.

For example, when I search on 'Edward Roe', I get these two results
first, which looks correct:

       Alfred Edward Rowe South Shields: Obituary

       Alfred Edward Rowe South Shields. Passed away on August 19.
       2009, aged 75. For our kind, gentle and always loving father.
       From your girls Debra, Janet and Carole. Respected father in law
       of Frank,...


       Obituary

       ROWE Albert Edward Eveleigh Passed away suddenly on February 7,
       2007. Funeral service at Portchester Crematorium on Wednesday,
       February 28, 2007, at 1.00 p.m. No flowers please. Donations,
       if...

However, if I search for 'Edward Roe Crematorium' (the second doc is
more relevant), then the second doc is bumped down the list (ie is less
relevant)

Is this just a question of getting the boosts right, or are my queries
not achieving what I hope?

You can see the live search here (at least, for now):
http://es.iannounce.co.uk/?keywords=edward+rowe+crematorium&sub_type=&date_limit=&region=23&source=0&_fstatus=search

Essentially what you want is to boost documents where these words appear closer together. At the same time you want matches in the name field to be boosted above matches in the text field.

I would think something along the lines of:


    "queryString" : { 
        "fields" : ["name^5", "text"], 
        "query" : "edward rowe crematorium",
        "phraseSlop" : "15",
        "useDisMax" : true
    }

Would suffice. 

I think your slop factor is too low (keep in mind this represents the max allowable distance). Bump it up a bit because the closer your terms, the higher the score (regardless of the max slop factor). In the link you provided, 4 is too low to generate a valid match.

The example above should give higher scores to documents where all these terms appear closer and at the same time, boost documents that have matches in the name field.

I haven't studied the elasticsearch DSL in depth but this is general logic you'd use to tune relevancy regardless of the query language you're using.

Start with weighting based on proximity (i.e. the closer the terms the higher the score), then boost specific fields that are more relevant (referred to as "context" weighting).

With Lucene, proximity is achieved via sloppy phrases and context weight is achieved via DisjunctionMaxQuery.

Regards,
-Eric