|
Hi
We are discussing building an index where possible misspellings at the end of a word are getting hits. We were looking at using the EdgeNGram and making ngrams of the last two characters, but that gives us an index of just the 2-character variations of the word endings. How would we best do this? Is it possible to configure the inverse of that? Should we tokenize it with a regexp? Any other ideas?
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote:
> Hi > > > We are discussing building an index where possible misspellings at the > end of a word are getting hits. > > > We were looking at using the EdgeNGram and making ngrams of the last > two characters, but that gives us an index of just the 2-character > variations of the word endings. > > > How would we best do this? Is it possible to configure the inverse of > that? Should we tokenize it with a regexp? Any other ideas? curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d ' { "settings" : { "analysis" : { "filter" : { "end_grams" : { "max_gram" : 2, "side" : "back", "min_gram" : 2, "type" : "edge_ngram" } }, "analyzer" : { "end_grams" : { "filter" : [ "standard", "lowercase", "stop", "end_grams" ], "tokenizer" : "standard" } } } } } ' curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick +brown+fox+jumped+over+the+lazy+dog&analyzer=end_grams' # { # "tokens" : [ # { # "end_offset" : 9, # "position" : 1, # "start_offset" : 7, # "type" : "word", # "token" : "ck" # }, # { # "end_offset" : 15, # "position" : 2, # "start_offset" : 13, # "type" : "word", # "token" : "wn" # }, # { # "end_offset" : 19, # "position" : 3, # "start_offset" : 17, # "type" : "word", # "token" : "ox" # }, # { # "end_offset" : 26, # "position" : 4, # "start_offset" : 24, # "type" : "word", # "token" : "ed" # }, # { # "end_offset" : 31, # "position" : 5, # "start_offset" : 29, # "type" : "word", # "token" : "er" # }, # { # "end_offset" : 40, # "position" : 6, # "start_offset" : 38, # "type" : "word", # "token" : "zy" # }, # { # "end_offset" : 44, # "position" : 7, # "start_offset" : 42, # "type" : "word", # "token" : "og" # } # ] # } clint -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Alright, that is pretty much what we've done so far, but I'm looking at getting "bro", "f", "jump"..... into the index, instead of the endings, And possibly the original words as well.
On Tue, Feb 26, 2013 at 12:02 PM, Clinton Gormley <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
I guess I was really unclear in my original text. I want to know how to strip the last couple of characters in a word, and also keep the original
On Tuesday, February 26, 2013 12:09:19 PM UTC+1, Per Ekman wrote: -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Per Ekman
On Tue, 2013-02-26 at 12:09 +0100, Per Ekman wrote:
> Alright, that is pretty much what we've done so far, but I'm looking > at getting "bro", "f", "jump"..... into the index, instead of the > endings, You specified that you wanted ngrams of the last two characters, which is why I set "side" to "back". > And possibly the original words as well. Just make the edge ngrams long enough. You may want to use a multi-field to have one field indexed with (eg) the standard analyzer, and another indexed with edge-ngrams, and you can query both of them in a single query, giving different boosts to each clause clint > > > On Tue, Feb 26, 2013 at 12:02 PM, Clinton Gormley > <[hidden email]> wrote: > On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote: > > Hi > > > > > > We are discussing building an index where possible > misspellings at the > > end of a word are getting hits. > > > > > > We were looking at using the EdgeNGram and making ngrams of > the last > > two characters, but that gives us an index of just the > 2-character > > variations of the word endings. > > > > > > How would we best do this? Is it possible to configure the > inverse of > > that? Should we tokenize it with a regexp? Any other ideas? > > > curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d ' > { > "settings" : { > "analysis" : { > "filter" : { > "end_grams" : { > "max_gram" : 2, > "side" : "back", > "min_gram" : 2, > "type" : "edge_ngram" > } > }, > "analyzer" : { > "end_grams" : { > "filter" : [ > "standard", > "lowercase", > "stop", > "end_grams" > ], > "tokenizer" : "standard" > } > } > } > } > } > ' > > curl -XGET > 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+quick > +brown+fox+jumped+over+the+lazy+dog&analyzer=end_grams' > > # { > # "tokens" : [ > # { > # "end_offset" : 9, > # "position" : 1, > # "start_offset" : 7, > # "type" : "word", > # "token" : "ck" > # }, > # { > # "end_offset" : 15, > # "position" : 2, > # "start_offset" : 13, > # "type" : "word", > # "token" : "wn" > # }, > # { > # "end_offset" : 19, > # "position" : 3, > # "start_offset" : 17, > # "type" : "word", > # "token" : "ox" > # }, > # { > # "end_offset" : 26, > # "position" : 4, > # "start_offset" : 24, > # "type" : "word", > # "token" : "ed" > # }, > # { > # "end_offset" : 31, > # "position" : 5, > # "start_offset" : 29, > # "type" : "word", > # "token" : "er" > # }, > # { > # "end_offset" : 40, > # "position" : 6, > # "start_offset" : 38, > # "type" : "word", > # "token" : "zy" > # }, > # { > # "end_offset" : 44, > # "position" : 7, > # "start_offset" : 42, > # "type" : "word", > # "token" : "og" > # } > # ] > # } > > > clint > > -- > You received this message because you are subscribed to the > Google Groups "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from > it, send an email to elasticsearch > +[hidden email]. > For more options, visit > https://groups.google.com/groups/opt_out. > > > > > > -- > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [hidden email]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Per Ekman
On Tue, 2013-02-26 at 03:13 -0800, Per Ekman wrote:
> I guess I was really unclear in my original text. I want to know how > to strip the last couple of characters in a word, and also keep the > original Ah right Currently you can't do that in the same field - you can have one field with the full word, and another field which uses the pattern tokenizer to drop the last two letters. I'm hoping to get a token filter accepted which does allow multiple captures per position in the same field: https://issues.apache.org/jira/browse/LUCENE-4766 but it'll be a while before that happens clint > > On Tuesday, February 26, 2013 12:09:19 PM UTC+1, Per Ekman wrote: > Alright, that is pretty much what we've done so far, but I'm > looking at getting "bro", "f", "jump"..... into the index, > instead of the endings, And possibly the original words as > well. > > > > On Tue, 2013-02-26 at 02:45 -0800, Per Ekman wrote: > > Hi > > > > > > We are discussing building an index where possible > misspellings at the > > end of a word are getting hits. > > > > > > We were looking at using the EdgeNGram and making > ngrams of the last > > two characters, but that gives us an index of just > the 2-character > > variations of the word endings. > > > > > > How would we best do this? Is it possible to > configure the inverse of > > that? Should we tokenize it with a regexp? Any other > ideas? > > > curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d > ' > { > "settings" : { > "analysis" : { > "filter" : { > "end_grams" : { > "max_gram" : 2, > "side" : "back", > "min_gram" : 2, > "type" : "edge_ngram" > } > }, > "analyzer" : { > "end_grams" : { > "filter" : [ > "standard", > "lowercase", > "stop", > "end_grams" > ], > "tokenizer" : "standard" > } > } > } > } > } > ' > > curl -XGET > 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The > +quick > +brown+fox+jumped+over+the+lazy > +dog&analyzer=end_grams' > > # { > # "tokens" : [ > # { > # "end_offset" : 9, > # "position" : 1, > # "start_offset" : 7, > # "type" : "word", > # "token" : "ck" > # }, > # { > # "end_offset" : 15, > # "position" : 2, > # "start_offset" : 13, > # "type" : "word", > # "token" : "wn" > # }, > # { > # "end_offset" : 19, > # "position" : 3, > # "start_offset" : 17, > # "type" : "word", > # "token" : "ox" > # }, > # { > # "end_offset" : 26, > # "position" : 4, > # "start_offset" : 24, > # "type" : "word", > # "token" : "ed" > # }, > # { > # "end_offset" : 31, > # "position" : 5, > # "start_offset" : 29, > # "type" : "word", > # "token" : "er" > # }, > # { > # "end_offset" : 40, > # "position" : 6, > # "start_offset" : 38, > # "type" : "word", > # "token" : "zy" > # }, > # { > # "end_offset" : 44, > # "position" : 7, > # "start_offset" : 42, > # "type" : "word", > # "token" : "og" > # } > # ] > # } > > > clint > > -- > You received this message because you are subscribed > to the Google Groups "elasticsearch" group. > To unsubscribe from this group and stop receiving > emails from it, send an email to elasticsearch > +[hidden email]. > For more options, visit > https://groups.google.com/groups/opt_out. > > > > > > > -- > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [hidden email]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Cool. Yeah, we were playing around with the pattern tokenizer to achieve this On Wed, Feb 27, 2013 at 3:17 PM, Clinton Gormley <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
| Powered by Nabble | Edit this page |
