Quantcast

Phrase matching using query_string on nGram analyzed data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Phrase matching using query_string on nGram analyzed data

Mike
I have my string field index_analyzed with nGrams, and I can't seem to get phrase matching using " " in my search text to work.  Other things like fuzzy matching with ~, combining words with && and ||, boosting with ^ work fine though.  Am I doing something wrong, or does phrase matching not work with ngrams?

My mapping:
                "properties" : {                                                                  
                    "myquery"      : {                                                              
                        "type" : "multi_field",                                                   
                        "fields" : {                                                              
                            "myquery"          : { "type" : "string", "index_analyzer" : "myAnalyzer", "search_analyzer" : "myAnalyzer2" },     
                            "myqueryUntouched" : { "type" : "string", "index" : "not_analyzed" }         
                        }                                                                         
                    },
                    ...                

My settings:
            "analysis" : {                                                                        
                "analyzer" : {                                                                    
                    "myAnalyzer" : {                                                              
                        "tokenizer" : "standard",                                                 
                        "filter" : ["standard", "lowercase", "stop", "myNGram"]                   
                    },                                                                            
                    "myAnalyzer2" : {                                                             
                        "tokenizer" : "standard",                                                 
                        "filter" : ["standard", "lowercase", "stop"]                              
                    }                                                                             
                },                                                                                
                "filter" : {                                                                      
                    "myNGram" : {                                                                 
                        "type" : "nGram",                                                         
                        "min_gram" : 1,                                                           
                        "max_gram" : 8                                                            
                    }                                                                             
                }                                                                                 
                                                           
My query:
"query":{
    "query_string":{
        "default_field":"myquery",
        "default_operator":"AND",
        "query":"\"ibm eps\""
    }
}


If I remove the escaped " ", I get many results as I expect, like:
ibm eps
ibm q2 eps
ibm 2001 eps

If someone adds " " though I want only the ibm eps results.

--
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Phrase matching using query_string on nGram analyzed data

Clinton Gormley-2
Hi Mike

On Fri, 2012-09-14 at 15:11 -0700, Mike wrote:
> I have my string field index_analyzed with nGrams, and I can't seem to
> get phrase matching using " " in my search text to work.  Other things
> like fuzzy matching with ~, combining words with && and ||, boosting
> with ^ work fine though.  Am I doing something wrong, or does phrase
> matching not work with ngrams?

Phrase matching does work with ngrams, but: there is a long-standing bug
in the edge-ngram analyzer in lucene which outputs different token
positions to the standard tokenizer.

So if you analyze the field with edge-ngrams and you do a phrase-search
on the field using the SAME analyzer, then it will work.  But you are
using the standard tokenizer at search time, not the edge-ngram
tokenizer.

clint

>
> My mapping:
>                 "properties" : {
>                              
>                     "myquery"      : {
>                                
>                         "type" : "multi_field",
>                            
>                         "fields" : {
>                              
>                             "myquery"          : { "type" : "string",
> "index_analyzer" : "myAnalyzer", "search_analyzer" : "myAnalyzer2" },
>    
>                             "myqueryUntouched" : { "type" : "string",
> "index" : "not_analyzed" }        
>                         }
>                            
>                     },
>                     ...                
>
> My settings:
>             "analysis" : {
>                              
>                 "analyzer" : {
>                              
>                     "myAnalyzer" : {
>                              
>                         "tokenizer" : "standard",
>                            
>                         "filter" : ["standard", "lowercase", "stop",
> "myNGram"]                  
>                     },
>                              
>                     "myAnalyzer2" : {
>                            
>                         "tokenizer" : "standard",
>                            
>                         "filter" : ["standard", "lowercase", "stop"]
>                              
>                     }
>                            
>                 },
>                              
>                 "filter" : {
>                              
>                     "myNGram" : {
>                            
>                         "type" : "nGram",
>                            
>                         "min_gram" : 1,
>                            
>                         "max_gram" : 8
>                              
>                     }
>                              
>                 }
>                            
>                                                            
> My query:
> "query":{
>     "query_string":{
>         "default_field":"myquery",
>         "default_operator":"AND",
>         "query":"\"ibm eps\""
>     }
> }
>
>
> If I remove the escaped " ", I get many results as I expect, like:
> ibm eps
> ibm q2 eps
> ibm 2001 eps
>
> If someone adds " " though I want only the ibm eps results.
>
> --
>  
>  


--


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Phrase matching using query_string on nGram analyzed data

Mike
Thanks for the response Clint!  I assume what you said applies to both the edge-nGram and regular nGram filters, since I am only using the regular nGrams filter in my index analyzer.  
 
You mentioned that I should use the ngram tokenizer not the standard tokenizer, does this mean that I should not use the ngram filter?  I was hoping to get partial search matches, which is why I used the ngram filter only during index time and not during query time as well (national should find a match with international).

--
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Phrase matching using query_string on nGram analyzed data

Clinton Gormley-2
On Mon, 2012-09-17 at 07:40 -0700, Mike wrote:
>         Thanks for the response Clint!  I assume what you said applies
>         to both the edge-nGram and regular nGram filters, since I am
>         only using the regular nGrams filter in my index analyzer.  

Yes, it affects the ngrams as well:

https://issues.apache.org/jira/browse/LUCENE-1224

>  
>         You mentioned that I should use the ngram tokenizer not the
>         standard tokenizer, does this mean that I should not use the
>         ngram filter?  I was hoping to get partial search matches,
>         which is why I used the ngram filter only during index time
>         and not during query time as well (national should find a
>         match with international).

No, you can use the ngram tokenizer or token filter.  The important
thing is to use the same analyzer at index and search time.  This is
almost a golden rule, unless you really understand what you're doing.

clint

>
> --
>  
>  


--


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Phrase matching using query_string on nGram analyzed data

trex
This post has NOT been accepted by the mailing list yet.
This post was updated on .
Hello, Clinton Gormley-2.

I use the edge ngram analyzer at index and search time. Nevertheless, I can't get results when I'm trying to a match phrase. What do I do wrong?


My query:
{
  "query":{
    "multi_match":{
      "query":"dementia in alz",
      "type":"phrase",
      "analyzer":"edge_ngram_analyzer",
      "fields":["_all"]
    }
  }
}

My mappings:
...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...
Loading...