Partial word match with singular and plurals: Elasticsearch

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Partial word match with singular and plurals: Elasticsearch

Kruti_Shukla
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Partial word match with singular and plurals: Elasticsearch

Radu Gheorghe-2
Hi Kruti,

The short answer is yes, it is possible. Here's one way to do it:

Have the fields you search on as multi field, where you index them with various settings, like once not-analyzed for exact matches, once with ngrams to account for typoes and so on. You can query all those sub-fields, and use the multi-match query with best fields or the DisMax query to wrap all those queries and take the best score (or the best score and a factor of the other scores by using the tie breaker).

Now, for the specific requirements you have:
1. For exact matching, you can skip analysis altogether, and set "index" to "not_anyzed". Alternatively, you could use the simple analyzer or something equally "harmless" to allow for some error. You could boost this kind of query a lot, so that exact matches come out on top
2. For phrase matches with distance, you can use the match_phrase type of the match query. You can configure a slop that defines the maximum allowed distance for a match to show up in your results. Documents with "closer" words should get higher scores. You would boost this query less than the exact matches, but more than the following.
3. For handling plurals, you'd probably need to do some stemming. Have a look at the snowball token filter or the stemmer token filter. Again, this would be boosted lower than 1) and 2), but more than 4)
4. For handling substrings, you can use ngrams, as you already seem to be doing. Alternatively, you can pay the price at query time by using the "fuziness" option of the match query.

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Thu, May 1, 2014 at 10:48 AM, Kruti Shukla <[hidden email]> wrote:
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_2EbGEEPrs0Gsf1hyNcyUE_JecusAgwfyR6xdh6RsamcA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Partial word match with singular and plurals: Elasticsearch

Kruti_Shukla
Hi Radu,

Thank you so for the suggestions. I was knowing mul-field but was not knowing how helpful it can be but now I'm able play with the multi field feature.
I tried following suggestion and created index and mapping accordingly.

I tried querying for first 2. First one was simple and second one with slop. It is not returning correct slop(i,e, incremental distance). 
Please help/suggest query improvements.

Please see my settings below:

For index: 
curl -XPUT "http://localhost:9200/my_improved_index" -d'
{
   "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 1,
                    "max_gram": 50
                },
                 "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "minimal_english"
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "trigrams_filter"
                    ]
                },
                "my_stemmer_analyzer":{
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "my_stemmer"
                    ]
                }
            }
        }
    }
}'

For mappings:
curl -XPUT "http://localhost:9200/my_improved_index/my_improved_index_type/_mapping" -d'
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               }
            }
         }
      }
   }
   
}'

Available documents:
1. men’s shaver
2. men’s shavers
3.     men’s foil shaver
4. men’s foils shaver
5. men’s foil shavers
6. men’s foils shavers
7.    men's foil advanced shaver
8.    norelco men's foil advanced shaver

Query:
curl -XPOST "http://localhost:9200/my_improved_index/my_improved_index_type/_search" -d'
{
   "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s shaver",
                     "slop": 5
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's shaver --> correct
2. men's shavers --> correct
3. men's foils shaver --> NOT correct
4. norelco men's foil advanced shaver --> NOT correct
5. men's foil advanced shaver --> NOT correct
6. men's foil shaver --> NOT correct. 

Expected result:
1. men's shaver --> exact phrase match
2. men's shavers --> ZERO word distance + 1 plural
3. men's foil shaver --> 1 word distance
4. men's foils shaver --> 1 word distance + 1 plural
5. men's foil advanced shaver --> 2 word distance
4. norelco men's foil advanced shaver --> 2 word distance

Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 7:26:02 AM UTC-4, Radu Gheorghe wrote:
Hi Kruti,

The short answer is yes, it is possible. Here's one way to do it:

Have the fields you search on as <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2F_multi_fields.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEWxo_yTH65McDL-CXl-qZrC6lN4w';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2F_multi_fields.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEWxo_yTH65McDL-CXl-qZrC6lN4w';return true;">multi field, where you index them with various settings, like once not-analyzed for exact matches, once with ngrams to account for typoes and so on. You can query all those sub-fields, and use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-best-fields" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-multi-match-query.html%23type-best-fields\46sa\75D\46sntz\0751\46usg\75AFQjCNEFIAJwN5gQOfVWVZ1BWpJbFrFrKQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-multi-match-query.html%23type-best-fields\46sa\75D\46sntz\0751\46usg\75AFQjCNEFIAJwN5gQOfVWVZ1BWpJbFrFrKQ';return true;">multi-match query with best fields or the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-dis-max-query.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5XldcGi6rp_Pob4TFfYNy0ha8jg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-dis-max-query.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5XldcGi6rp_Pob4TFfYNy0ha8jg';return true;">DisMax query to wrap all those queries and take the best score (or the best score and a factor of the other scores by using the tie breaker).

Now, for the specific requirements you have:
1. For exact matching, you can skip analysis altogether, and set "index" to "not_anyzed". Alternatively, you could use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html#analysis-simple-analyzer" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-simple-analyzer.html%23analysis-simple-analyzer\46sa\75D\46sntz\0751\46usg\75AFQjCNGiKHd05lA-NSNuVnVz6X-9sHRBUQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-simple-analyzer.html%23analysis-simple-analyzer\46sa\75D\46sntz\0751\46usg\75AFQjCNGiKHd05lA-NSNuVnVz6X-9sHRBUQ';return true;">simple analyzer or something equally "harmless" to allow for some error. You could boost this kind of query a lot, so that exact matches come out on top
2. For phrase matches with distance, you can use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_phrase" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-match-query.html%23_phrase\46sa\75D\46sntz\0751\46usg\75AFQjCNFyG-hwbwfyEhjH9iFcUVwEp-aN_g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-match-query.html%23_phrase\46sa\75D\46sntz\0751\46usg\75AFQjCNFyG-hwbwfyEhjH9iFcUVwEp-aN_g';return true;">match_phrase type of the match query. You can configure a slop that defines the maximum allowed distance for a match to show up in your results. Documents with "closer" words should get higher scores. You would boost this query less than the exact matches, but more than the following.
3. For handling plurals, you'd probably need to do some stemming. Have a look at the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-snowball-tokenfilter.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yhZIjUgfxNBqjUq05sL7xg2piw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-snowball-tokenfilter.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yhZIjUgfxNBqjUq05sL7xg2piw';return true;">snowball token filter or the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html#analysis-stemmer-tokenfilter" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-stemmer-tokenfilter.html%23analysis-stemmer-tokenfilter\46sa\75D\46sntz\0751\46usg\75AFQjCNGZf6B0sDRPOGAAfe77ZkrmxJIn8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-stemmer-tokenfilter.html%23analysis-stemmer-tokenfilter\46sa\75D\46sntz\0751\46usg\75AFQjCNGZf6B0sDRPOGAAfe77ZkrmxJIn8g';return true;">stemmer token filter. Again, this would be boosted lower than 1) and 2), but more than 4)
4. For handling substrings, you can use ngrams, as you already seem to be doing. Alternatively, you can pay the price at query time by using the "fuziness" option of the match query.

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * <a href="http://sematext.com/" style="font-size:13px;font-family:arial,sans-serif" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;">http://sematext.com/


On Thu, May 1, 2014 at 10:48 AM, Kruti Shukla <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="j4Zh6HkF6gQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">krutib...@...> wrote:
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.
Link to my other question "<a href="https://groups.google.com/forum/?utm_medium=email&amp;utm_source=footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?utm_medium\75email\46utm_source\75footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ';return true;" onclick="this.href='https://groups.google.com/forum/?utm_medium\75email\46utm_source\75footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ';return true;">https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ".

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="j4Zh6HkF6gQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ddfb4a67-8bfa-4e42-9979-33fab08dcef3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Partial word match with singular and plurals: Elasticsearch

Kruti_Shukla
Any help?
Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 8:37:09 AM UTC-4, Kruti Shukla wrote:
Hi Radu,

Thank you so for the suggestions. I was knowing mul-field but was not knowing how helpful it can be but now I'm able play with the multi field feature.
I tried following suggestion and created index and mapping accordingly.

I tried querying for first 2. First one was simple and second one with slop. It is not returning correct slop(i,e, incremental distance). 
Please help/suggest query improvements.

Please see my settings below:

For index: 
curl -XPUT "<a href="http://localhost:9200/my_improved_index" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index\46sa\75D\46sntz\0751\46usg\75AFQjCNFfh7MNZpHn2rsYCC6BWqtVjGfPlA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index\46sa\75D\46sntz\0751\46usg\75AFQjCNFfh7MNZpHn2rsYCC6BWqtVjGfPlA';return true;">http://localhost:9200/my_improved_index" -d'
{
   "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 1,
                    "max_gram": 50
                },
                 "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "minimal_english"
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "trigrams_filter"
                    ]
                },
                "my_stemmer_analyzer":{
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "my_stemmer"
                    ]
                }
            }
        }
    }
}'

For mappings:
curl -XPUT "<a href="http://localhost:9200/my_improved_index/my_improved_index_type/_mapping" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_mapping\46sa\75D\46sntz\0751\46usg\75AFQjCNEaEOWv3Ar-E3wU0jIRPkVXxBGZQw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_mapping\46sa\75D\46sntz\0751\46usg\75AFQjCNEaEOWv3Ar-E3wU0jIRPkVXxBGZQw';return true;">http://localhost:9200/my_improved_index/my_improved_index_type/_mapping" -d'
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               }
            }
         }
      }
   }
   
}'

Available documents:
1. men’s shaver
2. men’s shavers
3.     men’s foil shaver
4. men’s foils shaver
5. men’s foil shavers
6. men’s foils shavers
7.    men's foil advanced shaver
8.    norelco men's foil advanced shaver

Query:
curl -XPOST "<a href="http://localhost:9200/my_improved_index/my_improved_index_type/_search" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGIbbzdLEpZ_1XJIwaNKnt5HKGf8w';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGIbbzdLEpZ_1XJIwaNKnt5HKGf8w';return true;">http://localhost:9200/my_improved_index/my_improved_index_type/_search" -d'
{
   "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s shaver",
                     "slop": 5
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's shaver --> correct
2. men's shavers --> correct
3. men's foils shaver --> NOT correct
4. norelco men's foil advanced shaver --> NOT correct
5. men's foil advanced shaver --> NOT correct
6. men's foil shaver --> NOT correct. 

Expected result:
1. men's shaver --> exact phrase match
2. men's shavers --> ZERO word distance + 1 plural
3. men's foil shaver --> 1 word distance
4. men's foils shaver --> 1 word distance + 1 plural
5. men's foil advanced shaver --> 2 word distance
4. norelco men's foil advanced shaver --> 2 word distance

Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 7:26:02 AM UTC-4, Radu Gheorghe wrote:
Hi Kruti,

The short answer is yes, it is possible. Here's one way to do it:

Have the fields you search on as <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2F_multi_fields.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEWxo_yTH65McDL-CXl-qZrC6lN4w';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2F_multi_fields.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEWxo_yTH65McDL-CXl-qZrC6lN4w';return true;">multi field, where you index them with various settings, like once not-analyzed for exact matches, once with ngrams to account for typoes and so on. You can query all those sub-fields, and use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-best-fields" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-multi-match-query.html%23type-best-fields\46sa\75D\46sntz\0751\46usg\75AFQjCNEFIAJwN5gQOfVWVZ1BWpJbFrFrKQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-multi-match-query.html%23type-best-fields\46sa\75D\46sntz\0751\46usg\75AFQjCNEFIAJwN5gQOfVWVZ1BWpJbFrFrKQ';return true;">multi-match query with best fields or the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-dis-max-query.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5XldcGi6rp_Pob4TFfYNy0ha8jg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-dis-max-query.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5XldcGi6rp_Pob4TFfYNy0ha8jg';return true;">DisMax query to wrap all those queries and take the best score (or the best score and a factor of the other scores by using the tie breaker).

Now, for the specific requirements you have:
1. For exact matching, you can skip analysis altogether, and set "index" to "not_anyzed". Alternatively, you could use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html#analysis-simple-analyzer" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-simple-analyzer.html%23analysis-simple-analyzer\46sa\75D\46sntz\0751\46usg\75AFQjCNGiKHd05lA-NSNuVnVz6X-9sHRBUQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-simple-analyzer.html%23analysis-simple-analyzer\46sa\75D\46sntz\0751\46usg\75AFQjCNGiKHd05lA-NSNuVnVz6X-9sHRBUQ';return true;">simple analyzer or something equally "harmless" to allow for some error. You could boost this kind of query a lot, so that exact matches come out on top
2. For phrase matches with distance, you can use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_phrase" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-match-query.html%23_phrase\46sa\75D\46sntz\0751\46usg\75AFQjCNFyG-hwbwfyEhjH9iFcUVwEp-aN_g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-match-query.html%23_phrase\46sa\75D\46sntz\0751\46usg\75AFQjCNFyG-hwbwfyEhjH9iFcUVwEp-aN_g';return true;">match_phrase type of the match query. You can configure a slop that defines the maximum allowed distance for a match to show up in your results. Documents with "closer" words should get higher scores. You would boost this query less than the exact matches, but more than the following.
3. For handling plurals, you'd probably need to do some stemming. Have a look at the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-snowball-tokenfilter.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yhZIjUgfxNBqjUq05sL7xg2piw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-snowball-tokenfilter.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yhZIjUgfxNBqjUq05sL7xg2piw';return true;">snowball token filter or the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html#analysis-stemmer-tokenfilter" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-stemmer-tokenfilter.html%23analysis-stemmer-tokenfilter\46sa\75D\46sntz\0751\46usg\75AFQjCNGZf6B0sDRPOGAAfe77ZkrmxJIn8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-stemmer-tokenfilter.html%23analysis-stemmer-tokenfilter\46sa\75D\46sntz\0751\46usg\75AFQjCNGZf6B0sDRPOGAAfe77ZkrmxJIn8g';return true;">stemmer token filter. Again, this would be boosted lower than 1) and 2), but more than 4)
4. For handling substrings, you can use ngrams, as you already seem to be doing. Alternatively, you can pay the price at query time by using the "fuziness" option of the match query.

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * <a href="http://sematext.com/" style="font-size:13px;font-family:arial,sans-serif" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;">http://sematext.com/


On Thu, May 1, 2014 at 10:48 AM, Kruti Shukla <[hidden email]> wrote:
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.
Link to my other question "<a href="https://groups.google.com/forum/?utm_medium=email&amp;utm_source=footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?utm_medium\75email\46utm_source\75footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ';return true;" onclick="this.href='https://groups.google.com/forum/?utm_medium\75email\46utm_source\75footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ';return true;">https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ".

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e028f31d-e0e4-445e-864b-eac71782623a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Partial word match with singular and plurals: Elasticsearch

Radu Gheorghe-2
Hello,

The exact match vs plural is probably because of the stemmer. As you have your fields and queries now, Elasticsearch has no way to boost individual exact word matches higher. To fix this, you can add another field where you just analyze the text using the standard analyzer (no stemming). Then add that to another query within your bool and exact word matches should be ranked higher. Though I would do a simple match for that (no phrase), to account for the case where one word is exact and one is plural -> such a document should be ranked higher than if both are plurals. You'll get that with standard match because it looks for all terms, while match_phrase will try to match the phrase with the given slop and none of those two documents will get hit.

I don't know why the higher distance document is scored higher in your case - the 6th result should have been higher. Can you try with an index of one shard and see if results are any different?

Either way, you should get an explanation for each document's score by enabling Explain:

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, May 2, 2014 at 1:40 PM, Kruti Shukla <[hidden email]> wrote:
Any help?
Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 8:37:09 AM UTC-4, Kruti Shukla wrote:
Hi Radu,

Thank you so for the suggestions. I was knowing mul-field but was not knowing how helpful it can be but now I'm able play with the multi field feature.
I tried following suggestion and created index and mapping accordingly.

I tried querying for first 2. First one was simple and second one with slop. It is not returning correct slop(i,e, incremental distance). 
Please help/suggest query improvements.

Please see my settings below:

For index: 
{
   "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 1,
                    "max_gram": 50
                },
                 "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "minimal_english"
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "trigrams_filter"
                    ]
                },
                "my_stemmer_analyzer":{
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "my_stemmer"
                    ]
                }
            }
        }
    }
}'

For mappings:
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               }
            }
         }
      }
   }
   
}'

Available documents:
1. men’s shaver
2. men’s shavers
3.     men’s foil shaver
4. men’s foils shaver
5. men’s foil shavers
6. men’s foils shavers
7.    men's foil advanced shaver
8.    norelco men's foil advanced shaver

Query:
{
   "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s shaver",
                     "slop": 5
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's shaver --> correct
2. men's shavers --> correct
3. men's foils shaver --> NOT correct
4. norelco men's foil advanced shaver --> NOT correct
5. men's foil advanced shaver --> NOT correct
6. men's foil shaver --> NOT correct. 

Expected result:
1. men's shaver --> exact phrase match
2. men's shavers --> ZERO word distance + 1 plural
3. men's foil shaver --> 1 word distance
4. men's foils shaver --> 1 word distance + 1 plural
5. men's foil advanced shaver --> 2 word distance
4. norelco men's foil advanced shaver --> 2 word distance

Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 7:26:02 AM UTC-4, Radu Gheorghe wrote:
Hi Kruti,

The short answer is yes, it is possible. Here's one way to do it:

Have the fields you search on as multi field, where you index them with various settings, like once not-analyzed for exact matches, once with ngrams to account for typoes and so on. You can query all those sub-fields, and use the multi-match query with best fields or the DisMax query to wrap all those queries and take the best score (or the best score and a factor of the other scores by using the tie breaker).

Now, for the specific requirements you have:
1. For exact matching, you can skip analysis altogether, and set "index" to "not_anyzed". Alternatively, you could use the simple analyzer or something equally "harmless" to allow for some error. You could boost this kind of query a lot, so that exact matches come out on top
2. For phrase matches with distance, you can use the match_phrase type of the match query. You can configure a slop that defines the maximum allowed distance for a match to show up in your results. Documents with "closer" words should get higher scores. You would boost this query less than the exact matches, but more than the following.
3. For handling plurals, you'd probably need to do some stemming. Have a look at the snowball token filter or the stemmer token filter. Again, this would be boosted lower than 1) and 2), but more than 4)
4. For handling substrings, you can use ngrams, as you already seem to be doing. Alternatively, you can pay the price at query time by using the "fuziness" option of the match query.

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Thu, May 1, 2014 at 10:48 AM, Kruti Shukla <[hidden email]> wrote:
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e028f31d-e0e4-445e-864b-eac71782623a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHXA0_1DacX546MFVoXDk2897q2SFC1VMzKLxg%3DQ-tqmsmoXwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Partial word match with singular and plurals: Elasticsearch

Kruti_Shukla
Hi Radu,
Thank you so much for your reply and suggestion. It is really helping me solving my query as well as knowledge on elasticsearch.

I now have index on only 1 shard. Results are some what improved. 
Added one more field with "standard" analyzer.

PUT /my_improved_index/my_improved_index_type/_mapping
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "index_analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               },
               "name_standard":{
                   "type": "string",
                   "analyzer": "standard"
               }
            }
         }
      }
   }
   
}

There are still problem with return result.
Query:

curl -XPOST "http://localhost:9200/my_improved_index/my_improved_index_type/_search" -d'
    "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s foil shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s foil shaver",
                     "slop": 5
                  }
               }
            },
            {
               "match": {
                  "name.name_standard": {
                     "query": "men\"s foil shaver"
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's foil shaver --> score:  4.4437184
2. men's foils shaver --> socre: 0.5215846
3. men's foil advanced shaver --> score: 0.49008065  --> should be 4th
4. norelco men's foil advanced shaver --> score: 0.42882058  --> should be 5th
5. men's shaver --> score: 0.04429976 --> should be 6th
6. men’s foil shavers --> score: 0.010844119 --> should be 3rd
7. men's shavers --> score: 0.010372223 

Please suggest.. I tried having explain = true..but did not help much.

Below is the explanation for 6th return result "men's foil shavers":

{
            "_shard": 0,
            "_node": "VRNH3VrlTC2Tu6y_GgDZbw",
            "_index": "my_improved_index",
            "_type": "my_improved_index_type",
            "_id": "35",
            "_score": 0.010844119,
            "_source": {
               "name": "men’s foil shavers"
            },
            "_explanation": {
               "value": 0.010844119,
               "description": "product of:",
               "details": [
                  {
                     "value": 0.032532357,
                     "description": "sum of:",
                     "details": [
                        {
                           "value": 0.032532357,
                           "description": "product of:",
                           "details": [
                              {
                                 "value": 0.09759706,
                                 "description": "sum of:",
                                 "details": [
                                    {
                                       "value": 0.09759706,
                                       "description": "weight(name.name_standard:foil in 26) [PerFieldSimilarity], result of:",
                                       "details": [
                                          {
                                             "value": 0.09759706,
                                             "description": "score(doc=26,freq=1.0 = termFreq=1.0\n), product of:",
                                             "details": [
                                                {
                                                   "value": 0.07266014,
                                                   "description": "queryWeight, product of:",
                                                   "details": [
                                                      {
                                                         "value": 2.686399,
                                                         "description": "idf(docFreq=4, maxDocs=27)"
                                                      },
                                                      {
                                                         "value": 0.027047412,
                                                         "description": "queryNorm"
                                                      }
                                                   ]
                                                },
                                                {
                                                   "value": 1.3431995,
                                                   "description": "fieldWeight in 26, product of:",
                                                   "details": [
                                                      {
                                                         "value": 1,
                                                         "description": "tf(freq=1.0), with freq of:",
                                                         "details": [
                                                            {
                                                               "value": 1,
                                                               "description": "termFreq=1.0"
                                                            }
                                                         ]
                                                      },
                                                      {
                                                         "value": 2.686399,
                                                         "description": "idf(docFreq=4, maxDocs=27)"
                                                      },
                                                      {
                                                         "value": 0.5,
                                                         "description": "fieldNorm(doc=26)"
                                                      }
                                                   ]
                                                }
                                             ]
                                          }
                                       ]
                                    }
                                 ]
                              },
                              {
                                 "value": 0.33333334,
                                 "description": "coord(1/3)"
                              }
                           ]
                        }
                     ]
                  },
                  {
                     "value": 0.33333334,
                     "description": "coord(1/3)"
                  }
               ]
            }
         }

On Friday, May 2, 2014 8:30:03 AM UTC-4, Radu Gheorghe wrote:
Hello,

The exact match vs plural is probably because of the stemmer. As you have your fields and queries now, Elasticsearch has no way to boost individual exact word matches higher. To fix this, you can add another field where you just analyze the text using the standard analyzer (no stemming). Then add that to another query within your bool and exact word matches should be ranked higher. Though I would do a simple match for that (no phrase), to account for the case where one word is exact and one is plural -> such a document should be ranked higher than if both are plurals. You'll get that with standard match because it looks for all terms, while match_phrase will try to match the phrase with the given slop and none of those two documents will get hit.

I don't know why the higher distance document is scored higher in your case - the 6th result should have been higher. Can you try with an index of one shard and see if results are any different?

Either way, you should get an explanation for each document's score by enabling Explain:
<a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-explain.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-explain.html\46sa\75D\46sntz\0751\46usg\75AFQjCNGdZ6VyI5xFcaBvfOHEJK9wfKFlIQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fsearch-request-explain.html\46sa\75D\46sntz\0751\46usg\75AFQjCNGdZ6VyI5xFcaBvfOHEJK9wfKFlIQ';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-explain.html

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * <a href="http://sematext.com/" style="font-size:13px;font-family:arial,sans-serif" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;">http://sematext.com/


On Fri, May 2, 2014 at 1:40 PM, Kruti Shukla <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="9DP0arfscLQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">krutib...@...> wrote:
Any help?
Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 8:37:09 AM UTC-4, Kruti Shukla wrote:
Hi Radu,

Thank you so for the suggestions. I was knowing mul-field but was not knowing how helpful it can be but now I'm able play with the multi field feature.
I tried following suggestion and created index and mapping accordingly.

I tried querying for first 2. First one was simple and second one with slop. It is not returning correct slop(i,e, incremental distance). 
Please help/suggest query improvements.

Please see my settings below:

For index: 
curl -XPUT "<a href="http://localhost:9200/my_improved_index" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index\46sa\75D\46sntz\0751\46usg\75AFQjCNFfh7MNZpHn2rsYCC6BWqtVjGfPlA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index\46sa\75D\46sntz\0751\46usg\75AFQjCNFfh7MNZpHn2rsYCC6BWqtVjGfPlA';return true;">http://localhost:9200/my_improved_index" -d'
{
   "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 1,
                    "max_gram": 50
                },
                 "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "minimal_english"
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "trigrams_filter"
                    ]
                },
                "my_stemmer_analyzer":{
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "my_stemmer"
                    ]
                }
            }
        }
    }
}'

For mappings:
curl -XPUT "<a href="http://localhost:9200/my_improved_index/my_improved_index_type/_mapping" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_mapping\46sa\75D\46sntz\0751\46usg\75AFQjCNEaEOWv3Ar-E3wU0jIRPkVXxBGZQw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_mapping\46sa\75D\46sntz\0751\46usg\75AFQjCNEaEOWv3Ar-E3wU0jIRPkVXxBGZQw';return true;">http://localhost:9200/my_improved_index/my_improved_index_type/_mapping" -d'
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               }
            }
         }
      }
   }
   
}'

Available documents:
1. men’s shaver
2. men’s shavers
3.     men’s foil shaver
4. men’s foils shaver
5. men’s foil shavers
6. men’s foils shavers
7.    men's foil advanced shaver
8.    norelco men's foil advanced shaver

Query:
curl -XPOST "<a href="http://localhost:9200/my_improved_index/my_improved_index_type/_search" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGIbbzdLEpZ_1XJIwaNKnt5HKGf8w';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Flocalhost%3A9200%2Fmy_improved_index%2Fmy_improved_index_type%2F_search\46sa\75D\46sntz\0751\46usg\75AFQjCNGIbbzdLEpZ_1XJIwaNKnt5HKGf8w';return true;">http://localhost:9200/my_improved_index/my_improved_index_type/_search" -d'
{
   "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s shaver",
                     "slop": 5
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's shaver --> correct
2. men's shavers --> correct
3. men's foils shaver --> NOT correct
4. norelco men's foil advanced shaver --> NOT correct
5. men's foil advanced shaver --> NOT correct
6. men's foil shaver --> NOT correct. 

Expected result:
1. men's shaver --> exact phrase match
2. men's shavers --> ZERO word distance + 1 plural
3. men's foil shaver --> 1 word distance
4. men's foils shaver --> 1 word distance + 1 plural
5. men's foil advanced shaver --> 2 word distance
4. norelco men's foil advanced shaver --> 2 word distance

Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 7:26:02 AM UTC-4, Radu Gheorghe wrote:
Hi Kruti,

The short answer is yes, it is possible. Here's one way to do it:

Have the fields you search on as <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2F_multi_fields.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEWxo_yTH65McDL-CXl-qZrC6lN4w';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2F_multi_fields.html\46sa\75D\46sntz\0751\46usg\75AFQjCNEWxo_yTH65McDL-CXl-qZrC6lN4w';return true;">multi field, where you index them with various settings, like once not-analyzed for exact matches, once with ngrams to account for typoes and so on. You can query all those sub-fields, and use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-best-fields" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-multi-match-query.html%23type-best-fields\46sa\75D\46sntz\0751\46usg\75AFQjCNEFIAJwN5gQOfVWVZ1BWpJbFrFrKQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-multi-match-query.html%23type-best-fields\46sa\75D\46sntz\0751\46usg\75AFQjCNEFIAJwN5gQOfVWVZ1BWpJbFrFrKQ';return true;">multi-match query with best fields or the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-dis-max-query.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5XldcGi6rp_Pob4TFfYNy0ha8jg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-dis-max-query.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5XldcGi6rp_Pob4TFfYNy0ha8jg';return true;">DisMax query to wrap all those queries and take the best score (or the best score and a factor of the other scores by using the tie breaker).

Now, for the specific requirements you have:
1. For exact matching, you can skip analysis altogether, and set "index" to "not_anyzed". Alternatively, you could use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html#analysis-simple-analyzer" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-simple-analyzer.html%23analysis-simple-analyzer\46sa\75D\46sntz\0751\46usg\75AFQjCNGiKHd05lA-NSNuVnVz6X-9sHRBUQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-simple-analyzer.html%23analysis-simple-analyzer\46sa\75D\46sntz\0751\46usg\75AFQjCNGiKHd05lA-NSNuVnVz6X-9sHRBUQ';return true;">simple analyzer or something equally "harmless" to allow for some error. You could boost this kind of query a lot, so that exact matches come out on top
2. For phrase matches with distance, you can use the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_phrase" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-match-query.html%23_phrase\46sa\75D\46sntz\0751\46usg\75AFQjCNFyG-hwbwfyEhjH9iFcUVwEp-aN_g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fquery-dsl-match-query.html%23_phrase\46sa\75D\46sntz\0751\46usg\75AFQjCNFyG-hwbwfyEhjH9iFcUVwEp-aN_g';return true;">match_phrase type of the match query. You can configure a slop that defines the maximum allowed distance for a match to show up in your results. Documents with "closer" words should get higher scores. You would boost this query less than the exact matches, but more than the following.
3. For handling plurals, you'd probably need to do some stemming. Have a look at the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-snowball-tokenfilter.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yhZIjUgfxNBqjUq05sL7xg2piw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-snowball-tokenfilter.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG5yhZIjUgfxNBqjUq05sL7xg2piw';return true;">snowball token filter or the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html#analysis-stemmer-tokenfilter" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-stemmer-tokenfilter.html%23analysis-stemmer-tokenfilter\46sa\75D\46sntz\0751\46usg\75AFQjCNGZf6B0sDRPOGAAfe77ZkrmxJIn8g';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-stemmer-tokenfilter.html%23analysis-stemmer-tokenfilter\46sa\75D\46sntz\0751\46usg\75AFQjCNGZf6B0sDRPOGAAfe77ZkrmxJIn8g';return true;">stemmer token filter. Again, this would be boosted lower than 1) and 2), but more than 4)
4. For handling substrings, you can use ngrams, as you already seem to be doing. Alternatively, you can pay the price at query time by using the "fuziness" option of the match query.

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * <a href="http://sematext.com/" style="font-size:13px;font-family:arial,sans-serif" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fsematext.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFOz7jzL4dgjz1lPl99mo_THPxEYg';return true;">http://sematext.com/


On Thu, May 1, 2014 at 10:48 AM, Kruti Shukla <[hidden email]> wrote:
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.
Link to my other question "<a href="https://groups.google.com/forum/?utm_medium=email&amp;utm_source=footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?utm_medium\75email\46utm_source\75footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ';return true;" onclick="this.href='https://groups.google.com/forum/?utm_medium\75email\46utm_source\75footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ';return true;">https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/ui9OR7JARs4/Mp3oOtTqY0EJ".

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="9DP0arfscLQJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/e028f31d-e0e4-445e-864b-eac71782623a%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/e028f31d-e0e4-445e-864b-eac71782623a%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/e028f31d-e0e4-445e-864b-eac71782623a%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/e028f31d-e0e4-445e-864b-eac71782623a%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e99be4c6-d7d0-479b-8cf8-4986d01acf53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Partial word match with singular and plurals: Elasticsearch

Kruti_Shukla
I tried changing tokenizer from "standard" to "whitespace". In the mapping I separated "index_analyzer" to use my customer analyzer and search_analyzer to use default standard analyzer. Still the results are not improved.
Explanation is also not that helpful.


On Fri, May 2, 2014 at 9:10 AM, Kruti Shukla <[hidden email]> wrote:
Hi Radu,
Thank you so much for your reply and suggestion. It is really helping me solving my query as well as knowledge on elasticsearch.

I now have index on only 1 shard. Results are some what improved. 
Added one more field with "standard" analyzer.

PUT /my_improved_index/my_improved_index_type/_mapping
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "index_analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               },
               "name_standard":{
                   "type": "string",
                   "analyzer": "standard"
               }
            }
         }
      }
   }
   
}

There are still problem with return result.
Query:

    "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s foil shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s foil shaver",
                     "slop": 5
                  }
               }
            },
            {
               "match": {
                  "name.name_standard": {
                     "query": "men\"s foil shaver"
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's foil shaver --> score:  4.4437184
2. men's foils shaver --> socre: 0.5215846
3. men's foil advanced shaver --> score: 0.49008065  --> should be 4th
4. norelco men's foil advanced shaver --> score: 0.42882058  --> should be 5th
5. men's shaver --> score: 0.04429976 --> should be 6th
6. men’s foil shavers --> score: 0.010844119 --> should be 3rd
7. men's shavers --> score: 0.010372223 

Please suggest.. I tried having explain = true..but did not help much.

Below is the explanation for 6th return result "men's foil shavers":

{
            "_shard": 0,
            "_node": "VRNH3VrlTC2Tu6y_GgDZbw",
            "_index": "my_improved_index",
            "_type": "my_improved_index_type",
            "_id": "35",
            "_score": 0.010844119,
            "_source": {
               "name": "men’s foil shavers"
            },
            "_explanation": {
               "value": 0.010844119,
               "description": "product of:",
               "details": [
                  {
                     "value": 0.032532357,
                     "description": "sum of:",
                     "details": [
                        {
                           "value": 0.032532357,
                           "description": "product of:",
                           "details": [
                              {
                                 "value": 0.09759706,
                                 "description": "sum of:",
                                 "details": [
                                    {
                                       "value": 0.09759706,
                                       "description": "weight(name.name_standard:foil in 26) [PerFieldSimilarity], result of:",
                                       "details": [
                                          {
                                             "value": 0.09759706,
                                             "description": "score(doc=26,freq=1.0 = termFreq=1.0\n), product of:",
                                             "details": [
                                                {
                                                   "value": 0.07266014,
                                                   "description": "queryWeight, product of:",
                                                   "details": [
                                                      {
                                                         "value": 2.686399,
                                                         "description": "idf(docFreq=4, maxDocs=27)"
                                                      },
                                                      {
                                                         "value": 0.027047412,
                                                         "description": "queryNorm"
                                                      }
                                                   ]
                                                },
                                                {
                                                   "value": 1.3431995,
                                                   "description": "fieldWeight in 26, product of:",
                                                   "details": [
                                                      {
                                                         "value": 1,
                                                         "description": "tf(freq=1.0), with freq of:",
                                                         "details": [
                                                            {
                                                               "value": 1,
                                                               "description": "termFreq=1.0"
                                                            }
                                                         ]
                                                      },
                                                      {
                                                         "value": 2.686399,
                                                         "description": "idf(docFreq=4, maxDocs=27)"
                                                      },
                                                      {
                                                         "value": 0.5,
                                                         "description": "fieldNorm(doc=26)"
                                                      }
                                                   ]
                                                }
                                             ]
                                          }
                                       ]
                                    }
                                 ]
                              },
                              {
                                 "value": 0.33333334,
                                 "description": "coord(1/3)"
                              }
                           ]
                        }
                     ]
                  },
                  {
                     "value": 0.33333334,
                     "description": "coord(1/3)"
                  }
               ]
            }
         }

On Friday, May 2, 2014 8:30:03 AM UTC-4, Radu Gheorghe wrote:
Hello,

The exact match vs plural is probably because of the stemmer. As you have your fields and queries now, Elasticsearch has no way to boost individual exact word matches higher. To fix this, you can add another field where you just analyze the text using the standard analyzer (no stemming). Then add that to another query within your bool and exact word matches should be ranked higher. Though I would do a simple match for that (no phrase), to account for the case where one word is exact and one is plural -> such a document should be ranked higher than if both are plurals. You'll get that with standard match because it looks for all terms, while match_phrase will try to match the phrase with the given slop and none of those two documents will get hit.

I don't know why the higher distance document is scored higher in your case - the 6th result should have been higher. Can you try with an index of one shard and see if results are any different?

Either way, you should get an explanation for each document's score by enabling Explain:

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, May 2, 2014 at 1:40 PM, Kruti Shukla <[hidden email]> wrote:
Any help?
Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 8:37:09 AM UTC-4, Kruti Shukla wrote:
Hi Radu,

Thank you so for the suggestions. I was knowing mul-field but was not knowing how helpful it can be but now I'm able play with the multi field feature.
I tried following suggestion and created index and mapping accordingly.

I tried querying for first 2. First one was simple and second one with slop. It is not returning correct slop(i,e, incremental distance). 
Please help/suggest query improvements.

Please see my settings below:

For index: 
{
   "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 1,
                    "max_gram": 50
                },
                 "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "minimal_english"
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "trigrams_filter"
                    ]
                },
                "my_stemmer_analyzer":{
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "standard",
                        "lowercase",
                        "my_stemmer"
                    ]
                }
            }
        }
    }
}'

For mappings:
{
    "my_improved_index_type": {
      "properties": {
         "name": {
            "type": "multi_field",
            "fields": {
               "name_gram": {
                  "type": "string",
                  "analyzer": "trigrams"
               },
               "untouched": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "name_stemmer":{
                   "type": "string",
                   "analyzer": "my_stemmer_analyzer"
               }
            }
         }
      }
   }
   
}'

Available documents:
1. men’s shaver
2. men’s shavers
3.     men’s foil shaver
4. men’s foils shaver
5. men’s foil shavers
6. men’s foils shavers
7.    men's foil advanced shaver
8.    norelco men's foil advanced shaver

Query:
{
   "size": 30,
   "query": {
      "bool": {
         "should": [
            {
               "match": {
                  "name.untouched": {
                     "query": "men\"s shaver",
                     "operator": "and",
                     "type": "phrase",
                     "boost": "10"
                  }
               }
            },
            {
               "match_phrase": {
                  "name.name_stemmer": {
                     "query": "men\"s shaver",
                     "slop": 5
                  }
               }
            }
         ]
      }
   }
}'

Returned result:
1. men's shaver --> correct
2. men's shavers --> correct
3. men's foils shaver --> NOT correct
4. norelco men's foil advanced shaver --> NOT correct
5. men's foil advanced shaver --> NOT correct
6. men's foil shaver --> NOT correct. 

Expected result:
1. men's shaver --> exact phrase match
2. men's shavers --> ZERO word distance + 1 plural
3. men's foil shaver --> 1 word distance
4. men's foils shaver --> 1 word distance + 1 plural
5. men's foil advanced shaver --> 2 word distance
4. norelco men's foil advanced shaver --> 2 word distance

Why higher distance document scored higher?
Is there any problem with stemmer or nGram settings?


On Thursday, May 1, 2014 7:26:02 AM UTC-4, Radu Gheorghe wrote:
Hi Kruti,

The short answer is yes, it is possible. Here's one way to do it:

Have the fields you search on as multi field, where you index them with various settings, like once not-analyzed for exact matches, once with ngrams to account for typoes and so on. You can query all those sub-fields, and use the multi-match query with best fields or the DisMax query to wrap all those queries and take the best score (or the best score and a factor of the other scores by using the tie breaker).

Now, for the specific requirements you have:
1. For exact matching, you can skip analysis altogether, and set "index" to "not_anyzed". Alternatively, you could use the simple analyzer or something equally "harmless" to allow for some error. You could boost this kind of query a lot, so that exact matches come out on top
2. For phrase matches with distance, you can use the match_phrase type of the match query. You can configure a slop that defines the maximum allowed distance for a match to show up in your results. Documents with "closer" words should get higher scores. You would boost this query less than the exact matches, but more than the following.
3. For handling plurals, you'd probably need to do some stemming. Have a look at the snowball token filter or the stemmer token filter. Again, this would be boosted lower than 1) and 2), but more than 4)
4. For handling substrings, you can use ngrams, as you already seem to be doing. Alternatively, you can pay the price at query time by using the "fuziness" option of the match query.

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Thu, May 1, 2014 at 10:48 AM, Kruti Shukla <[hidden email]> wrote:
My final goal is to have following search precedence:
1. Exact phrase match
2. Exact word match with incremental distance
3. Plurals
4. Substring

Suppose I have following documents:
i. men’s shaver
ii. men’s shavers
iii.     men’s foil shaver
iv. men’s foils shaver
v. men’s foil shavers
vi. men’s foils shavers

Case 1: search for : “men’s foil shaver”
Expected result:
1. men’s foil shaver <------ exact phrase match
2. men’s foil shavers <------ exact word match on 2 of 3 words with 0 word distance + plural
3. men’s foils shaver <------ exact word match on 2 of 3 words with 1 word distance + plural
4. men’s foils shavers <------ exact word match on 1 of 3 words + 2 plurals
5. men’s shaver <------ exact word match on 2 of 3 words (66% match)
6. men’s shavers <------ exact word match on 1 of 3 words + plural (66% match)

Case 2: search for : “men’s foil shavers”
Expected result:
1. men’s foil shavers <------ exact phrase match
2. men’s foil shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foils shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foils shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Case 3: search for : “men’s foils shavers”
Expected result:
1. men’s foils shavers <------ exact phrase match
2. men’s foils shaver <------ exact word match on 2 of 3 words with 0 word distance + singular
3. men’s foil shavers <------ exact word match on 2 of 3 words with 1 word distance + singular
4. men’s foil shaver <------ exact word match on 1 of 3 words + 2 singulars
5. men’s shavers <------ exact word match on 2 of 3 words (66% match)
6. men’s shaver <------ exact word match on 1 of 3 words + singular (66% match)


Is there any way in elasticsearch I can achieve this?
This question is related to my other question which is not answered yet.

Any suggestion would help!
Thank you.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2ead70e-c5d6-4001-87fd-645a16e670dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/ET-S3SCD22I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e99be4c6-d7d0-479b-8cf8-4986d01acf53%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Cheers!!
Kruti

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACofF64zU_TeqvRiRCvLgA4U1HWPDm%2BgDmXeQB-UUxtWtRNEkA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.