|
Hi,
I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified. Thanks in advance Pierre
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
You could use a whitespace tokenizer instead to preserve punctuation on this field...
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.' On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote: Hi, You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Thank you for response,
but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well Pierre. Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
-- You could use a whitespace tokenizer instead to preserve punctuation on this field... You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...
egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{ "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 }, "analysis" : { "filter" : { "my_delimiter" : { "type" : "word_delimiter", "split_on_numerics" : true, "split_on_case_change" : true, "my_delimiter.catenate_numbers" : true, "generate_word_parts" : true, "protected_words": ["C++", "C#"] } } } } }' curl -XGET 'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.' Test that out and see if it does what you need. You can tweak other settings on the word_delimeter to meet your needs. On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote: Thank you for response, You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
thank you, this fits my needs
Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit : You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
If you know the list of keywords to protect, you can also use a Keyword Marker Token Filter. On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <[hidden email]> wrote: thank you, this fits my needs You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Ivan, unfortunately the keywordMarkerFilter only works for in combination with stemmers.I added the keyword attribute years ago to prevent some stemmers from running the stemming alg on terms that are known to be names etc. I don't think this would help here.
In general I would recommend to use a simple tokenizer like whitespace and then use synonym filter to transform these kind of token (c++ / c#) to a text represenations (cPLUSPLUS / CSHARP) then you can go wild with WordDelimiterFilter etc. once you did this mapping. simon
-- On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:
You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
My mistake! I read the word "protect" and thought of the keyword marker filter. I once wrote a custom token filter on a Lucene project I was on, not related to stemming, that used the keyword attributes. Useful attribute, but it is post tokenization and not what the OP is looking for. Nowadays in Lucene I use a pattern tokenizer since the whitespace tokenizer is too lenient, plus a word_delimiter filter (and stemmer overrides). --
Ivan On Sat, Feb 16, 2013 at 7:15 AM, simonw <[hidden email]> wrote: Ivan, unfortunately the keywordMarkerFilter only works for in combination with stemmers.I added the keyword attribute years ago to prevent some stemmers from running the stemming alg on terms that are known to be names etc. I don't think this would help here. You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
| Powered by Nabble | Edit this page |
