ES partial matching (ngram) use case

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

ES partial matching (ngram) use case

mehmet.onler
This post has NOT been accepted by the mailing list yet.
hi everybody
I have an index for keeping book records such as;
ElasticSearch Cookbook
ElasticSearch Server
Mastering ElasticSearch
ElasticSearch

i have more than 2M records.

search cases:
search term --- expected result ---  (case)
---------------------------------------------------------
elastic cook --- ElasticSearch Cookbook --- (partial match)
ElasticSearhCookBook --- ElasticSearch Cookbook --- (no space)
ekasticsearch --- ElasticSearch  --- (typo)
etc.


My index settings and mappings are as below;

Analyzer:
I have 5 analyzer.
edge_nGram_no_split_field: use edge n gram on whole search term(minNGram= 1,maxNGram=25 )
edge_nGram_token_field: use edge n gram on each search term token(minNGram= 2,maxNGram= 15)
nGram_no_space_field: remove whitespace from search term then use n gram (minNGram=3,maxNGram=4)
exact_field: look for exact match for seach term
token_field: look exact match for each token of seach term

{
  "book": {
    "settings": {
      "index": {
        "analysis": {
          "filter": {
            "edgeNGramFilter": {
              "type": "edgeNGram",
              "min_gram": "2",
              "max_gram": "15"
            }
          },
          "char_filter": {
            "noSpace": {
              "type": "pattern_replace",
              "pattern": " ",
              "replacement": ""
            },
            "quotes": {
              "pattern": "'",
              "type": "pattern_replace",
              "replacement": ""
            }
          },
          "analyzer": {
            "edge_nGram_no_split_field": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "char_filter": [
                "quotes"
              ],
              "type": "custom",
              "tokenizer": "no_split_edge_nGram"
            },
            "edge_nGram_token_field": {
              "filter": [
                "lowercase",
                "asciifolding",
                "edgeNGramFilter"
              ],
              "char_filter": [
                "quotes"
              ],
              "type": "custom",
              "tokenizer": "standard"
            },
            "nGram_no_space_field": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "char_filter": [
                "quotes",
                "noSpace"
              ],
              "type": "custom",
              "tokenizer": "no_space_nGram"
            },
            "exact_field": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "type": "custom",
              "tokenizer": "keyword"
            },
            "token_field": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "char_filter": [
                "quotes"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          },
          "tokenizer": {
            "no_space_nGram": {
              "type": "nGram",
              "min_gram": "3",
              "max_gram": "4"
            },
            "no_split_edge_nGram": {
              "type": "edgeNGram",
              "min_gram": "1",
              "max_gram": "15"
            }
          }
        },
        "number_of_shards": "2",
        "number_of_replicas": "1",
        "version": {
          "created": "1030499"
        },
        "uuid": "BuWYNc9LQbeDU7GHEUeAQw"
      }
    }
  }
}

Mapping:


{
  "book": {
    "mappings": {
      "en": {
        "properties": {
          "id": {
            "type": "string"
          },
          "name.nGramNoSpace": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "analyzer": "nGram_no_space_field"
          },
          "name.edgeNGram": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "analyzer": "edge_nGram_token_field"
          },
          "name.edgeNGramNoSplit": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "analyzer": "edge_nGram_no_split_field"
          },
          "name.exact": {
            "type": "string",
            "analyzer": "exact_field"
          },
          "name.token": {
            "type": "string",
            "norms": {
              "enabled": false
            },
            "analyzer": "token_field"
          }
        }
      }
    }
  }
}

My query is

minimum should match is calculated by following formula:

minimum_should_match = lenght of search term * 0.25

{
  "bool" : {
    "must" : {
      "match" : {
        "name" : {
          "query" : "elastic cook",
          "type" : "phrase",
          "operator" : "OR",
          "fuzziness" : "1",
          "max_expansions" : 4,
          "minimum_should_match" : "2",
          "cutoff_frequency" : 0.01
        }
      }
    },
    "should" : [ {
      "match" : {
        "name.exact" : {
          "query" : "elastic cook",
          "type" : "phrase",
          "boost" : 4.0
        }
      }
    }, {
      "match" : {
        "name.token" : {
          "query" : "elastic cook",
          "type" : "phrase"
        }
      }
    }, {
      "match" : {
        "name.edgeNGramNoSplit" : {
          "query" : "elastic cook",
          "type" : "phrase",
          "boost" : 4.0,
          "fuzziness" : "1",
          "max_expansions" : 8
        }
      }
    }, {
      "match" : {
        "name.edgeNGram" : {
          "query" : "elastic cook",
          "type" : "phrase",
          "fuzziness" : "1",
          "max_expansions" : 4
        }
      }
    } ]
  }
}

with this settings and with this query response times are approximately like below
length of search term -- response time(ms)
3  --  120
4  --  130
5  --  140
6  --  150
7  --  165
8  --  195
9  --  225
10  -- 270
11  -- 350
12  --  400
13  --  450
14  --  600
15  --  700

As i mention, i have more than 2M record. is this response time is normal?
Or am i doing something wrong?