Scoring based on the number of matches in the field

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Scoring based on the number of matches in the field

Andre Dantas Rocha
Hi there,

I have the following query:

"query": {
  "multi_match": {
    "operator": "and",
    "type": "cross_fields",
    "query": "john smith",
    "fields": ["name", "address"]
  }
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms "john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using match_phrase with slop because the query can still match terms in any of the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Doug Turnbull
First, note that in Lucene's default similarity there already are two biases towards matches with fewer fields. Try to take advantage of those before going on a boosting expedition

1. Each term tends to get converted into a boolean SHOULD clause. Every SHOULD clause match gets added to the score. So the fewer matches, the lower the score.

2. For an even stronger bias, Lucene adds *coord* or the coordinating factor. If only 1 out of 3 search terms match the field being searched, a multiple of 1/3 is applied thus punishing the score. So matches where more terms match should have a much higher chance of winning.


Huh you're thinking, why doesn't my scenario just work then. What you're doing is *cross_field* search. Cross field search is something new to Elasticsearch whereby both fields are blended together and treated like a single field. So the biasing above applies to the two fields together. If you want to know more about cross-field search -- here's an article I recently wrote http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

If you want to actually have a bias towards a field with more matches, I'd recommend best_field or most_fields search. They will take both search terms to each field first, performing different searches in each field. Then they will be combined (either by adding or taking the max score).

Untill I finish the related chapter in the search relevance book I'm writing <shameless plug :-p http://manning.com/turnbull> the best place to read about these topics are the docs or the online guide. In particular, this appears relevant

Hope that helps


On Mon, Apr 13, 2015 at 7:30 PM, Andre Dantas Rocha <[hidden email]> wrote:
Hi there,

I have the following query:

"query": {
  "multi_match": {
    "operator": "and",
    "type": "cross_fields",
    "query": "john smith",
    "fields": ["name", "address"]
  }
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms "john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using match_phrase with slop because the query can still match terms in any of the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com 
Author: Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL-BjkkULxXKH4WnbMnUBJF2TowjTe%2B51cHaJHE%2B2GBLcw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Doug Turnbull
Sorry for the confusing typo -- "towards matches with fewer fields". fields should be search terms

On Mon, Apr 13, 2015 at 9:30 PM, Doug Turnbull <[hidden email]> wrote:
First, note that in Lucene's default similarity there already are two biases towards matches with fewer fields. Try to take advantage of those before going on a boosting expedition

1. Each term tends to get converted into a boolean SHOULD clause. Every SHOULD clause match gets added to the score. So the fewer matches, the lower the score.

2. For an even stronger bias, Lucene adds *coord* or the coordinating factor. If only 1 out of 3 search terms match the field being searched, a multiple of 1/3 is applied thus punishing the score. So matches where more terms match should have a much higher chance of winning.


Huh you're thinking, why doesn't my scenario just work then. What you're doing is *cross_field* search. Cross field search is something new to Elasticsearch whereby both fields are blended together and treated like a single field. So the biasing above applies to the two fields together. If you want to know more about cross-field search -- here's an article I recently wrote http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

If you want to actually have a bias towards a field with more matches, I'd recommend best_field or most_fields search. They will take both search terms to each field first, performing different searches in each field. Then they will be combined (either by adding or taking the max score).

Untill I finish the related chapter in the search relevance book I'm writing <shameless plug :-p http://manning.com/turnbull> the best place to read about these topics are the docs or the online guide. In particular, this appears relevant

Hope that helps


On Mon, Apr 13, 2015 at 7:30 PM, Andre Dantas Rocha <[hidden email]> wrote:
Hi there,

I have the following query:

"query": {
  "multi_match": {
    "operator": "and",
    "type": "cross_fields",
    "query": "john smith",
    "fields": ["name", "address"]
  }
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms "john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using match_phrase with slop because the query can still match terms in any of the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com 
Author: Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.




--
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com 
Author: Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9dr-At%2BxtWqsT6%3D%2BGehKEYsZsp2rvxp%3D9KFqPFbgiUjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Andre Dantas Rocha
Hi Doug,

Thank you for your quick response and comprehensive explanation. It does make sense.

We are using cross_fields (with the "and" operator) because we want to make sure that the documents returned contain all the search terms somewhere. 

For example, the search for "100 john smith" would return only one document. ("john smith" matches the name and "100" matches the address")

We expect no results for "200 john smith" as 200 appears nowhere.

But if we search for "john smith" we should get both documents back and the document with "john smith" should be the first one is the list (since terms "john" "smith" matches on the same field).

Is there possible to accomplish this with best_fields or most_fields?

Thanks again,

Andre

On Tuesday, April 14, 2015 at 12:21:33 PM UTC+10, Doug Turnbull wrote:
Sorry for the confusing typo -- "towards matches with fewer fields". fields should be search terms

On Mon, Apr 13, 2015 at 9:30 PM, Doug Turnbull <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="EzU7nYhLeRwJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">dtur...@opensourceconnections.com> wrote:
First, note that in Lucene's default similarity there already are two biases towards matches with fewer fields. Try to take advantage of those before going on a boosting expedition

1. Each term tends to get converted into a boolean SHOULD clause. Every SHOULD clause match gets added to the score. So the fewer matches, the lower the score.

2. For an even stronger bias, Lucene adds *coord* or the coordinating factor. If only 1 out of 3 search terms match the field being searched, a multiple of 1/3 is applied thus punishing the score. So matches where more terms match should have a much higher chance of winning.

If you want to know more, read Lucene's javadocs on similarity: <a href="https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Flucene.apache.org%2Fcore%2F5_0_0%2Fcore%2Forg%2Fapache%2Flucene%2Fsearch%2Fsimilarities%2FTFIDFSimilarity.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEwlieTsB2GbmGmYFEXyIGYiAx3sw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Flucene.apache.org%2Fcore%2F5_0_0%2Fcore%2Forg%2Fapache%2Flucene%2Fsearch%2Fsimilarities%2FTFIDFSimilarity.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEwlieTsB2GbmGmYFEXyIGYiAx3sw';return true;">https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Huh you're thinking, why doesn't my scenario just work then. What you're doing is *cross_field* search. Cross field search is something new to Elasticsearch whereby both fields are blended together and treated like a single field. So the biasing above applies to the two fields together. If you want to know more about cross-field search -- here's an article I recently wrote <a href="http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fopensourceconnections.com%2Fblog%2F2015%2F03%2F19%2Felasticsearch-cross-field-search-is-a-lie%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH4NgbnQnjJeIrO3_KYLv1iYxYYzw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fopensourceconnections.com%2Fblog%2F2015%2F03%2F19%2Felasticsearch-cross-field-search-is-a-lie%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH4NgbnQnjJeIrO3_KYLv1iYxYYzw';return true;">http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

If you want to actually have a bias towards a field with more matches, I'd recommend best_field or most_fields search. They will take both search terms to each field first, performing different searches in each field. Then they will be combined (either by adding or taking the max score).

Untill I finish the related chapter in the search relevance book I'm writing <shameless plug :-p <a href="http://manning.com/turnbull" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;">http://manning.com/turnbull> the best place to read about these topics are the docs or the online guide. In particular, this appears relevant
<a href="http://www.elastic.co/guide/en/elasticsearch/guide/master/multi-field-search.html" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fmaster%2Fmulti-field-search.html\46sa\75D\46sntz\0751\46usg\75AFQjCNGlJ9jNjmQ9OS0wgGVtE-DGmyxCxA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fmaster%2Fmulti-field-search.html\46sa\75D\46sntz\0751\46usg\75AFQjCNGlJ9jNjmQ9OS0wgGVtE-DGmyxCxA';return true;">http://www.elastic.co/guide/en/elasticsearch/guide/master/multi-field-search.html

Hope that helps


On Mon, Apr 13, 2015 at 7:30 PM, Andre Dantas Rocha <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="EzU7nYhLeRwJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">andre.dan...@...> wrote:
Hi there,

I have the following query:

"query": {
  "multi_match": {
    "operator": "and",
    "type": "cross_fields",
    "query": "john smith",
    "fields": ["name", "address"]
  }
}

That will match these documents:

Name: James Smith
Address: 325 John Street

Name: John Smith Junior
Address: 100 Baryl Street

Is there a way to give the last document a higher score since the terms "john" "smith" have two matches on the same field?

Notice that behavior is a little bit different from the one using match_phrase with slop because the query can still match terms in any of the fields but score higher when there are more matches on the same field.

Thanks,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="EzU7nYhLeRwJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/cc76f51b-3721-4978-a3ed-e59ff4c8f138%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

-- 
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | <a href="http://www.opensourceconnections.com/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.opensourceconnections.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNENYzeUvLLGbDwBce8aGJDfHfsN2g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.opensourceconnections.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNENYzeUvLLGbDwBce8aGJDfHfsN2g';return true;">http://www.opensourceconnections.com 
Author: <a href="http://manning.com/turnbull" style="font-style:italic;font-size:13px;font-family:Helvetica" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;">Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.




--
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | <a href="http://www.opensourceconnections.com/" style="color:rgb(17,85,204)" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.opensourceconnections.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNENYzeUvLLGbDwBce8aGJDfHfsN2g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.opensourceconnections.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNENYzeUvLLGbDwBce8aGJDfHfsN2g';return true;">http://www.opensourceconnections.com 
Author: <a href="http://manning.com/turnbull" style="font-style:italic;color:rgb(17,85,204);font-size:13px;font-family:Helvetica" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fturnbull\46sa\75D\46sntz\0751\46usg\75AFQjCNFazDwbgqS0hdRlNmV8GuhTrLVEKg';return true;">Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49b56ab9-bdd0-4d7d-b63f-963d05b70744%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Andre Dantas Rocha
In reply to this post by Doug Turnbull
Hi Doug,

Thank you for your quick response and comprehensive explanation. It does make sense.

We are using cross_fields (with the "and" operator) because we want to make sure that the documents returned contain all the search terms somewhere. 

For example, the search for "100 john smith" would return only one document. ("john smith" matches the name and "100" matches the address")

We expect no results for "200 john smith" as 200 appears nowhere.

But if we search for "john smith" we should get both documents back and the document with "john smith" should be the first one is the list (since terms "john" "smith" matches on the same field).

Is there possible to accomplish this with best_fields or most_fields?

Thanks again,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b9f21865-c060-434e-b456-80a230ac6439%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Doug Turnbull
If you want this to be a mininum you could

- Use your cross fields query as a filter. Specifically, a query filter: http://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-query-filter.html
- Use a most or best fields as your main search query

This would eliminate any search results without a match somewhere, cutting off the long tail as you need to, but score using most fields.

Does that make sense?

On Tue, Apr 14, 2015 at 12:24 AM, Andre Dantas Rocha <[hidden email]> wrote:
Hi Doug,

Thank you for your quick response and comprehensive explanation. It does make sense.

We are using cross_fields (with the "and" operator) because we want to make sure that the documents returned contain all the search terms somewhere. 

For example, the search for "100 john smith" would return only one document. ("john smith" matches the name and "100" matches the address")

We expect no results for "200 john smith" as 200 appears nowhere.

But if we search for "john smith" we should get both documents back and the document with "john smith" should be the first one is the list (since terms "john" "smith" matches on the same field).

Is there possible to accomplish this with best_fields or most_fields?

Thanks again,

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b9f21865-c060-434e-b456-80a230ac6439%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Doug Turnbull Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com 
Author: Taming Search from Manning Publications
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL-otHfwDshPnv5e8Ys4gm8wnOdZzkSzzW8Fs9YEYodrUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Andre Dantas Rocha
Hi Doug,

Yes. it does make sense. I'll try to rewrite it and get back to you.

Thank you again for your help,
Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3e69b3de-30f1-4952-845a-6d75eff846f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Scoring based on the number of matches in the field

Andre Dantas Rocha
Hi Doug,

Your suggestion worked perfectly!

Thank very much.

Andre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8c1b36c1-d0fb-45e7-8e3b-2b4934e02c7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.