[Theory] Improving search result relevance?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Theory] Improving search result relevance?

Zachary Tong
I'm curious about some practical tips to improve search result relevance.  Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field.  I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles).  However, search results leave something to be desired.  I imagine there are ways to fix this...I just don't know how.

For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency.  There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently.  Ideally, something that matches the phrase would rank higher.

So, how should I attack this problem?  I'm thinking something like this:
  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be matched too?  Would it be better to do something like a phrase query with slop instead?
Does that make sense?  I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.

Thanks!
-Zach

--
 
 
Reply | Threaded
Open this post in threaded view
|

Improving search result relevance?

Rauan Maemirov
Hi, all. I'm having a little bit different problem, but I guess in essence it's the same.

I have an index with items and trying to search by title 'iphone 5'.
I can get well sorted items 'iphone 5' and then all other 'iphone 3g', 'iphone 4s', etc.

Now my problem is that there's also 'Loreal Elseve 5' in search results, i.e. elastic including in search results all entries with number 5 (and the score is pretty high). How could I solve it?

I don't want to filter out all numbers at indexing phase, because they're very useful in such a case when I search for keyword followed by number or version.

On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance.  Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field.  I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles).  However, search results leave something to be desired.  I imagine there are ways to fix this...I just don't know how.

For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency.  There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently.  Ideally, something that matches the phrase would rank higher.

So, how should I attack this problem?  I'm thinking something like this:
  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be matched too?  Would it be better to do something like a phrase query with slop instead?
Does that make sense?  I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.

Thanks!
-Zach

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Improving search result relevance?

Clinton Gormley-2
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

> Hi, all. I'm having a little bit different problem, but I guess in
> essence it's the same.
>
>
> I have an index with items and trying to search by title 'iphone 5'.
> I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
> 'iphone 4s', etc.
>
>
> Now my problem is that there's also 'Loreal Elseve 5' in search
> results, i.e. elastic including in search results all entries with
> number 5 (and the score is pretty high). How could I solve it?


You could try setting minimum_should_match to eg "60%"

clint

>
>
> I don't want to filter out all numbers at indexing phase, because
> they're very useful in such a case when I search for keyword followed
> by number or version.
>
> On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
>         I'm curious about some practical tips to improve search result
>         relevance.  Currently, I'm tokenizing my fields with shingles
>         and performing a simple "text" search on the shingled field.
>          I've found this gives better results than other things I've
>         tried (combinations of: terms, n-grams, phrase, shingles).
>          However, search results leave something to be desired.  I
>         imagine there are ways to fix this...I just don't know how.
>        
>        
>         For example, if I search for "Servo Gear", it will match all
>         documents with either "Servo" or "Gear" and order them roughly
>         based on frequency.  There is some preference to documents
>         that say "Servo Gear" explicitly, but often a document that
>         lists "Gear" four times will rank higher simply because it has
>         the term more frequently.  Ideally, something that matches the
>         phrase would rank higher.
>        
>        
>         So, how should I attack this problem?  I'm thinking something
>         like this:
>               * Analyzers
>                       * Regular term tokenizer
>                       * Shingles, but turn off unigrams
>               * Search both terms and shingles, but boost shingles so
>                 that phrase matches are sorted higher
>               * Perhaps search using span_near so that non-exact
>                 phrases can be matched too?  Would it be better to do
>                 something like a phrase query with slop instead?
>         Does that make sense?  I understand ES well enough from a
>         technical point of view, but I'm having a hard time
>         implementing more subtle search algorithms that can surface
>         the correct documents.
>        
>        
>         Thanks!
>         -Zach
>
> --
>  
>  


Reply | Threaded
Open this post in threaded view
|

Re: Improving search result relevance?

ppearcy
Specifically regarding exact phrases getting ranked higher, I like using a phrase boost technique and use a term based analyzer. This breaks down like:
(field:"test search")^PhraseBoostValue OR field:(test search)

Best Regards,
Paul

On Monday, January 28, 2013 8:58:27 AM UTC-7, Clinton Gormley wrote:
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

> Hi, all. I'm having a little bit different problem, but I guess in
> essence it's the same.
>
>
> I have an index with items and trying to search by title 'iphone 5'.
> I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
> 'iphone 4s', etc.
>
>
> Now my problem is that there's also 'Loreal Elseve 5' in search
> results, i.e. elastic including in search results all entries with
> number 5 (and the score is pretty high). How could I solve it?


You could try setting minimum_should_match to eg "60%"

clint

>
>
> I don't want to filter out all numbers at indexing phase, because
> they're very useful in such a case when I search for keyword followed
> by number or version.
>
> On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
>         I'm curious about some practical tips to improve search result
>         relevance.  Currently, I'm tokenizing my fields with shingles
>         and performing a simple "text" search on the shingled field.
>          I've found this gives better results than other things I've
>         tried (combinations of: terms, n-grams, phrase, shingles).
>          However, search results leave something to be desired.  I
>         imagine there are ways to fix this...I just don't know how.
>        
>        
>         For example, if I search for "Servo Gear", it will match all
>         documents with either "Servo" or "Gear" and order them roughly
>         based on frequency.  There is some preference to documents
>         that say "Servo Gear" explicitly, but often a document that
>         lists "Gear" four times will rank higher simply because it has
>         the term more frequently.  Ideally, something that matches the
>         phrase would rank higher.
>        
>        
>         So, how should I attack this problem?  I'm thinking something
>         like this:
>               * Analyzers
>                       * Regular term tokenizer
>                       * Shingles, but turn off unigrams
>               * Search both terms and shingles, but boost shingles so
>                 that phrase matches are sorted higher
>               * Perhaps search using span_near so that non-exact
>                 phrases can be matched too?  Would it be better to do
>                 something like a phrase query with slop instead?
>         Does that make sense?  I understand ES well enough from a
>         technical point of view, but I'm having a hard time
>         implementing more subtle search algorithms that can surface
>         the correct documents.
>        
>        
>         Thanks!
>         -Zach
>
> --
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Improving search result relevance?

Rauan Maemirov
In reply to this post by Clinton Gormley-2
Hi, Clinton.

I tried, but i still keep getting any occurences of 5.

Anu other suggestions? I already use query_string fields boosting like "fields": ["title^2", "tags^2", "description"]

On Monday, January 28, 2013 9:58:27 PM UTC+6, Clinton Gormley wrote:
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

> Hi, all. I'm having a little bit different problem, but I guess in
> essence it's the same.
>
>
> I have an index with items and trying to search by title 'iphone 5'.
> I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
> 'iphone 4s', etc.
>
>
> Now my problem is that there's also 'Loreal Elseve 5' in search
> results, i.e. elastic including in search results all entries with
> number 5 (and the score is pretty high). How could I solve it?


You could try setting minimum_should_match to eg "60%"

clint

>
>
> I don't want to filter out all numbers at indexing phase, because
> they're very useful in such a case when I search for keyword followed
> by number or version.
>
> On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
>         I'm curious about some practical tips to improve search result
>         relevance.  Currently, I'm tokenizing my fields with shingles
>         and performing a simple "text" search on the shingled field.
>          I've found this gives better results than other things I've
>         tried (combinations of: terms, n-grams, phrase, shingles).
>          However, search results leave something to be desired.  I
>         imagine there are ways to fix this...I just don't know how.
>        
>        
>         For example, if I search for "Servo Gear", it will match all
>         documents with either "Servo" or "Gear" and order them roughly
>         based on frequency.  There is some preference to documents
>         that say "Servo Gear" explicitly, but often a document that
>         lists "Gear" four times will rank higher simply because it has
>         the term more frequently.  Ideally, something that matches the
>         phrase would rank higher.
>        
>        
>         So, how should I attack this problem?  I'm thinking something
>         like this:
>               * Analyzers
>                       * Regular term tokenizer
>                       * Shingles, but turn off unigrams
>               * Search both terms and shingles, but boost shingles so
>                 that phrase matches are sorted higher
>               * Perhaps search using span_near so that non-exact
>                 phrases can be matched too?  Would it be better to do
>                 something like a phrase query with slop instead?
>         Does that make sense?  I understand ES well enough from a
>         technical point of view, but I'm having a hard time
>         implementing more subtle search algorithms that can surface
>         the correct documents.
>        
>        
>         Thanks!
>         -Zach
>
> --
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: [Theory] Improving search result relevance?

simonw-2
In reply to this post by Zachary Tong
Hey,

On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance.  Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field.  I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles).  However, search results leave something to be desired.  I imagine there are ways to fix this...I just don't know how.
 
For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency.  There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently.  Ideally, something that matches the phrase would rank higher.

So, how should I attack this problem?  I'm thinking something like this:
  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be matched too?  Would it be better to do something like a phrase query with slop instead?
Does that make sense?  I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.

Shingles are a good start here. I would personally index the shingles in a dedicated field without unigrams and have a secondary field that doesn't use shingles. That way you can boost the shingle field according to your needs. I would also think about using a 
DijunctionMaxQuery as the top-level query and for each sub query (one on the shingle field and one on the unigram field) you use the minimum_should_match syntax to donate when the query should produce a match.

simon

Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: [Theory] Improving search result relevance?

Alan Woodward
We implemented this sort of phrase-boosting for a client recently by shingling the query string outside elasticsearch and then adding the shingles as phrase queries in SHOULD clauses.  So a search for 'annual leave entitlement' became:

"bool" : { "must" : { "query_string" : "annual leave entitlement" },
"should" : [ { "text" : { "type" : "phrase", "query" : "annual leave" } },
{ "text" : "type" : "phrase", "query" : "leave entitlement" }} ] }

Alan Woodward
www.flax.co.uk


On 29 Jan 2013, at 07:17, simonw wrote:

Hey,

On Wednesday, November 28, 2012 4:51:56 AM UTC+1, Zachary Tong wrote:
I'm curious about some practical tips to improve search result relevance.  Currently, I'm tokenizing my fields with shingles and performing a simple "text" search on the shingled field.  I've found this gives better results than other things I've tried (combinations of: terms, n-grams, phrase, shingles).  However, search results leave something to be desired.  I imagine there are ways to fix this...I just don't know how.
 
For example, if I search for "Servo Gear", it will match all documents with either "Servo" or "Gear" and order them roughly based on frequency.  There is some preference to documents that say "Servo Gear" explicitly, but often a document that lists "Gear" four times will rank higher simply because it has the term more frequently.  Ideally, something that matches the phrase would rank higher.

So, how should I attack this problem?  I'm thinking something like this:
  • Analyzers
    • Regular term tokenizer
    • Shingles, but turn off unigrams
  • Search both terms and shingles, but boost shingles so that phrase matches are sorted higher
  • Perhaps search using span_near so that non-exact phrases can be matched too?  Would it be better to do something like a phrase query with slop instead?
Does that make sense?  I understand ES well enough from a technical point of view, but I'm having a hard time implementing more subtle search algorithms that can surface the correct documents.

Shingles are a good start here. I would personally index the shingles in a dedicated field without unigrams and have a secondary field that doesn't use shingles. That way you can boost the shingle field according to your needs. I would also think about using a 
DijunctionMaxQuery as the top-level query and for each sub query (one on the shingle field and one on the unigram field) you use the minimum_should_match syntax to donate when the query should produce a match.

simon

Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Improving search result relevance?

Rauan Maemirov
In reply to this post by Rauan Maemirov
My question is still open.
What is the most general solution to this?

I tried to query with use_dis_max, but it doen't change a lot.
I would try to set threshold on score, but every single occurence of number 5 in index have a score roughly the same as the most relevant results.

On Tuesday, January 29, 2013 10:19:33 AM UTC+6, Rauan Maemirov wrote:
Hi, Clinton.

I tried, but i still keep getting any occurences of 5.

Anu other suggestions? I already use query_string fields boosting like "fields": ["title^2", "tags^2", "description"]

On Monday, January 28, 2013 9:58:27 PM UTC+6, Clinton Gormley wrote:
On Sun, 2013-01-27 at 20:17 -0800, Rauan Maemirov wrote:

> Hi, all. I'm having a little bit different problem, but I guess in
> essence it's the same.
>
>
> I have an index with items and trying to search by title 'iphone 5'.
> I can get well sorted items 'iphone 5' and then all other 'iphone 3g',
> 'iphone 4s', etc.
>
>
> Now my problem is that there's also 'Loreal Elseve 5' in search
> results, i.e. elastic including in search results all entries with
> number 5 (and the score is pretty high). How could I solve it?


You could try setting minimum_should_match to eg "60%"

clint

>
>
> I don't want to filter out all numbers at indexing phase, because
> they're very useful in such a case when I search for keyword followed
> by number or version.
>
> On Wednesday, November 28, 2012 9:51:56 AM UTC+6, Zachary Tong wrote:
>         I'm curious about some practical tips to improve search result
>         relevance.  Currently, I'm tokenizing my fields with shingles
>         and performing a simple "text" search on the shingled field.
>          I've found this gives better results than other things I've
>         tried (combinations of: terms, n-grams, phrase, shingles).
>          However, search results leave something to be desired.  I
>         imagine there are ways to fix this...I just don't know how.
>        
>        
>         For example, if I search for "Servo Gear", it will match all
>         documents with either "Servo" or "Gear" and order them roughly
>         based on frequency.  There is some preference to documents
>         that say "Servo Gear" explicitly, but often a document that
>         lists "Gear" four times will rank higher simply because it has
>         the term more frequently.  Ideally, something that matches the
>         phrase would rank higher.
>        
>        
>         So, how should I attack this problem?  I'm thinking something
>         like this:
>               * Analyzers
>                       * Regular term tokenizer
>                       * Shingles, but turn off unigrams
>               * Search both terms and shingles, but boost shingles so
>                 that phrase matches are sorted higher
>               * Perhaps search using span_near so that non-exact
>                 phrases can be matched too?  Would it be better to do
>                 something like a phrase query with slop instead?
>         Does that make sense?  I understand ES well enough from a
>         technical point of view, but I'm having a hard time
>         implementing more subtle search algorithms that can surface
>         the correct documents.
>        
>        
>         Thanks!
>         -Zach
>
> --
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.