|
Hello,
I was reading this group posts and it seems to be two school of thoughts for ngram use 1. index with ngram enabled analyzer but search with analyzer without ngrams so that a complete search terms are matched against ngrams 2. index with ngrams and search with ngrams My understanding is: #1 will require very long ngrams, there will be very few (one?) term matches per document and the longer/rarer the ngram matched the better is the match . It is essentially generationg tonns of "synonyms" (ngrams) for your searched field and match your terms to them. One of the problem is that ngram length should essentially be longer that the longest word. That seems to be an issue - while handful of characters is often enough to identify the document (think auto-complete scenario) providing a longer than max ngram length search token will return no hits #2 will need short 3-5 character ngrams at most and will match n-grammed search term against ngrammed field in the index. The more matches the better score. The precision is probably not as good as #1 so it would need to be combined with search on original field and maybe shingled field. But will potentially handle simple typos I have two use cases (both to be used in auto-complete pick lists) 1. A long identifier (contract number) 10-30 character which needs to be searched on any part of it 2. Company name which need to be searched on individual words from start of the words (could use phrase prefix query or edgeNgram) Could you please share your opinion about #1 and #2 (and any other techniques you used) and their applicability to my cases Thank you, Alex You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
The general approach is to index ngrams in a separate field and then craft a query that searches on both fields but boosts matches on the non ngram field. This way you match on partial words (ngrams) but favor matches on whole tokens. This is generally where DisMax is useful because the query plays an important role in fine tuning the relevance.
-Eric On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote: Hello, You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Thank you Eric I understand that but you can use them in two ways as per my post. On Feb 20, 2013 12:45 PM, "egaumer" <[hidden email]> wrote:
-- The general approach is to index ngrams in a separate field and then craft a query that searches on both fields but boosts matches on the non ngram field. This way you match on partial words (ngrams) but favor matches on whole tokens. This is generally where DisMax is useful because the query plays an important role in fine tuning the relevance. You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
For autocomplete I typically use:
- whitespace tokenizer - word delimiter token filter - edge-ngram token filter At query time, I do not perform the edge-ngrams. This approach will work for your 2nd use-case, but your first use-case is kind of tricky. I would index that field twice, the first field would use: - keyword tokenizer - edge ngram The 2nd field would use: - keyword tokenizer - reverse token filter - edge ngram Again, skip the edge-ngrams at query time. This will allow prefix matching and suffix matching on your contract number. A contract number of 12345, you will get as a suggestion for queries of 12 or 345. Hope this helps. Thanks, Matt Weber On Wed, Feb 20, 2013 at 10:50 AM, Alex Roytman <[hidden email]> wrote: > Thank you Eric I understand that but you can use them in two ways as per my > post. > > On Feb 20, 2013 12:45 PM, "egaumer" <[hidden email]> wrote: >> >> The general approach is to index ngrams in a separate field and then craft >> a query that searches on both fields but boosts matches on the non ngram >> field. This way you match on partial words (ngrams) but favor matches on >> whole tokens. This is generally where DisMax is useful because the query >> plays an important role in fine tuning the relevance. >> >> -Eric >> >> >> >> On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote: >>> >>> Hello, >>> >>> I was reading this group posts and it seems to be two school of thoughts >>> for ngram use >>> >>> 1. index with ngram enabled analyzer but search with analyzer without >>> ngrams so that a complete search terms are matched against ngrams >>> 2. index with ngrams and search with ngrams >>> >>> My understanding is: >>> >>> #1 will require very long ngrams, there will be very few (one?) term >>> matches per document and the longer/rarer the ngram matched the better is >>> the match . It is essentially generationg tonns of "synonyms" (ngrams) for >>> your searched field and match your terms to them. One of the problem is that >>> ngram length should essentially be longer that the longest word. That seems >>> to be an issue - while handful of characters is often enough to identify the >>> document (think auto-complete scenario) providing a longer than max ngram >>> length search token will return no hits >>> >>> #2 will need short 3-5 character ngrams at most and will match n-grammed >>> search term against ngrammed field in the index. The more matches the better >>> score. The precision is probably not as good as #1 so it would need to be >>> combined with search on original field and maybe shingled field. But will >>> potentially handle simple typos >>> >>> I have two use cases (both to be used in auto-complete pick lists) >>> >>> 1. A long identifier (contract number) 10-30 character which needs to be >>> searched on any part of it >>> 2. Company name which need to be searched on individual words from start >>> of the words (could use phrase prefix query or edgeNgram) >>> >>> Could you please share your opinion about #1 and #2 (and any other >>> techniques you used) and their applicability to my cases >>> >>> Thank you, >>> Alex >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [hidden email]. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [hidden email]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by AlexR
Doing ngram analysis on the query side will usually introduce a lot of noise (i.e., relevance is bad).
The problem with auto-suggest is that it's hard to get relevance tuned just right because you're usually matching against very small text fragments. At the same time, relevance is really subjective making it hard to measure with any real accuracy. Doing ngram analysis on the query side exacerbates the problem in my experience. With that said, use cases differ as does the quality of the data driving the auto-suggest and that can cause your milage to vary. If I had doubts I'd just test both cases against my actual data and requirements. That'll provide a more definitive answer.
-- -Eric On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote:
You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Matt Weber-2
Thanks Matt!
That is what I was going to do before I found some older thread about using short ngrams at both indexing and searching. I was intrigued if it would work well. Ed's experience (the post below) is that not very well. I am going to try just to have some first hand experience but I am pretty sure it is the approach you outlined I will goo with. One question I have is whether you index direct and reverse edge engrams into the same filed or two separate ones. Particularly as it relates to highlighting. Will highlighting work if I index both into the same field? Thanks Alex
-- On Wednesday, February 20, 2013 2:17:52 PM UTC-5, Matt Weber wrote: For autocomplete I typically use: You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by egaumer
Thanks Ed. I suspected that much. But as you have suggested I will do a quick test. Maybe the nature of the data (beginning of the alphanum contract number is usually good deal less unique than the end) will make it work well
On Wednesday, February 20, 2013 2:25:11 PM UTC-5, egaumer wrote: Doing ngram analysis on the query side will usually introduce a lot of noise (i.e., relevance is bad).-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
It will need to be two fields, one normal, one reverse. You are going
to need to experiment with highlighting... I have a feeling that is going to give you some mixed results. BTW, the other poster is Eric, not Ed. :) On Wed, Feb 20, 2013 at 2:56 PM, AlexR <[hidden email]> wrote: > Thanks Ed. I suspected that much. But as you have suggested I will do a > quick test. Maybe the nature of the data (beginning of the alphanum contract > number is usually good deal less unique than the end) will make it work well > > > On Wednesday, February 20, 2013 2:25:11 PM UTC-5, egaumer wrote: >> >> Doing ngram analysis on the query side will usually introduce a lot of >> noise (i.e., relevance is bad). >> >> The problem with auto-suggest is that it's hard to get relevance tuned >> just right because you're usually matching against very small text >> fragments. At the same time, relevance is really subjective making it hard >> to measure with any real accuracy. Doing ngram analysis on the query side >> exacerbates the problem in my experience. With that said, use cases differ >> as does the quality of the data driving the auto-suggest and that can cause >> your milage to vary. >> >> If I had doubts I'd just test both cases against my actual data and >> requirements. That'll provide a more definitive answer. >> >> -Eric >> >> >> On Wednesday, February 20, 2013 1:50:48 PM UTC-5, AlexR wrote: >>> >>> Thank you Eric I understand that but you can use them in two ways as per >>> my post. >>> >>> On Feb 20, 2013 12:45 PM, "egaumer" <[hidden email]> wrote: >>>> >>>> The general approach is to index ngrams in a separate field and then >>>> craft a query that searches on both fields but boosts matches on the non >>>> ngram field. This way you match on partial words (ngrams) but favor matches >>>> on whole tokens. This is generally where DisMax is useful because the query >>>> plays an important role in fine tuning the relevance. >>>> >>>> -Eric >>>> >>>> >>>> >>>> On Wednesday, February 20, 2013 12:04:36 PM UTC-5, AlexR wrote: >>>>> >>>>> Hello, >>>>> >>>>> I was reading this group posts and it seems to be two school of >>>>> thoughts for ngram use >>>>> >>>>> 1. index with ngram enabled analyzer but search with analyzer without >>>>> ngrams so that a complete search terms are matched against ngrams >>>>> 2. index with ngrams and search with ngrams >>>>> >>>>> My understanding is: >>>>> >>>>> #1 will require very long ngrams, there will be very few (one?) term >>>>> matches per document and the longer/rarer the ngram matched the better is >>>>> the match . It is essentially generationg tonns of "synonyms" (ngrams) for >>>>> your searched field and match your terms to them. One of the problem is that >>>>> ngram length should essentially be longer that the longest word. That seems >>>>> to be an issue - while handful of characters is often enough to identify the >>>>> document (think auto-complete scenario) providing a longer than max ngram >>>>> length search token will return no hits >>>>> >>>>> #2 will need short 3-5 character ngrams at most and will match >>>>> n-grammed search term against ngrammed field in the index. The more matches >>>>> the better score. The precision is probably not as good as #1 so it would >>>>> need to be combined with search on original field and maybe shingled field. >>>>> But will potentially handle simple typos >>>>> >>>>> I have two use cases (both to be used in auto-complete pick lists) >>>>> >>>>> 1. A long identifier (contract number) 10-30 character which needs to >>>>> be searched on any part of it >>>>> 2. Company name which need to be searched on individual words from >>>>> start of the words (could use phrase prefix query or edgeNgram) >>>>> >>>>> Could you please share your opinion about #1 and #2 (and any other >>>>> techniques you used) and their applicability to my cases >>>>> >>>>> Thank you, >>>>> Alex >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [hidden email]. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>>> > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [hidden email]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi there,
Interesting, I was experimenting with very similar use case (search suggestions on [possibly] short list of one-to-few words codes) with highlighting. It seems to be working fine and I can share more details if you are interested (though I would like to check couple of details first to make sure it is not buggy). My only concern is that my approach would not scale well for large data (I am not using edgeNGrams but nGrams).
Regards, Lukas
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by AlexR
Sorry Eric :-( I was talking to Ed at the time and gut names confused.... On Wed, Feb 20, 2013 at 5:56 PM, AlexR <[hidden email]> wrote: Thanks Ed. I suspected that much. But as you have suggested I will do a quick test. Maybe the nature of the data (beginning of the alphanum contract number is usually good deal less unique than the end) will make it work well You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
No worries, I've been called much worse ;-)
-Eric On Wednesday, February 20, 2013 6:36:58 PM UTC-5, AlexR wrote:
You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Lukáš Vlček
Hi Lukas,
It will be very interesting to compare notes. I will be out of town for few days and may not be able to conclude my test so lets touch base next week if it's ok with you Alex -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Matt Weber-2
Matt,
My understanding is that prefix and suffix edge ngrams will only deal with searching on prefix and suffix but not on any internal substring of my contract number. I think I have to go with short ngram at index and search time and use match query with "and" to ensure precise (almost) match BTW with suffix ngram I do not think there is a need to reverse filters as edge engram can be applied from the end of the word. As I understand reversing (index and search) is needed to get the back aligned ngrams but since they are supported directly no need to use it On Wednesday, February 20, 2013 2:17:52 PM UTC-5, Matt Weber wrote: For autocomplete I typically use: You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
In reply to this post by Lukáš Vlček
Hi Lukas,
I did a bit testing and I could see several approaches for autocomplete style search in ANY part of a long identifier string (i.e. contract number) 1. Indexing and searching with short ngrams and searching using match with and condition 2. Searching by prefix or suffix (not any part) - indexing twice with start and back edge ngram and searching using term on un-engramed criteria 3. Use long ngram (say from 3 to 40 characters in my case) longer than maximum length of indexed contract number and searching it with un-ngrammed criteria. #1 works fine and highlights well. There may possibly be some cases false hits but it should be pretty accurate in my case of searching contract numbers #2 works fine but it is limited to prefix/suffix searches #3 searches fine but highlighting is very erratic - sometimes it highlights and sometimes it does not the hits. Looks like a bug to me unless I am missing something Another option is to do back (reverse) edge engrams and do prefix search on the result. I have not tried it but it should probably work well not sure about highlighting though. Would you share your findings? Thank you, Alex On Wednesday, February 20, 2013 6:34:43 PM UTC-5, Lukáš Vlček wrote: Hi there, You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Tested searching using A) match/and with short (3 char) ngrams (index and search time) vs B) using reverse (back aligned) edge ngrams with prefix query.
My search is in strings (contract numbers) of about 20-30 characters and when my search string is short (4-6 chars) A runs about 2-3 times faster than B. When I plug 15-20 characters as my search string A and B run at about the same speed
-- On Monday, February 25, 2013 2:28:15 PM UTC-5, AlexR wrote: Hi Lukas, You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Can you post a gist of a sample mapping/ sample query?
On Tuesday, February 26, 2013 3:14:17 AM UTC+4, AlexR wrote: Tested searching using A) match/and with short (3 char) ngrams (index and search time) vs B) using reverse (back aligned) edge ngrams with prefix query.-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Here it is
You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Interesting. will play around with it and will post you back if I can find a way (fast) to get rid of the false positives! Thanks for sharing.
Which one did you end up using? or are you still in research phase?
-- On Wednesday, February 27, 2013 4:38:47 AM UTC+4, AlexR wrote: Here it is You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Still prototyping. For now I use prefix query on back aligned edge ngrams. I will experiment some more. Wonder how match/phrase works against short ngrams vs match/and
Please share your findings -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
| Powered by Nabble | Edit this page |
