Quantcast

protect some words when tokenizing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

protect some words when tokenizing

Pierre de Soyres-2
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

egaumer
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

Pierre De Soyres
Thank you for response, 

but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

egaumer
You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },  
        "analysis" : {
            "filter" : {
                "my_delimiter" : {
                    "type" : "word_delimiter",
                    "split_on_numerics" : true,
                    "split_on_case_change" : true,
                    "my_delimiter.catenate_numbers" : true,
                    "generate_word_parts" : true,
                    "protected_words": ["C++", "C#"]
                }   
            }   
        }   
    }   
}' 

curl -XGET 'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other settings on the word_delimeter to meet your needs.




On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response, 

but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

Pierre De Soyres
thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :
You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },  
        "analysis" : {
            "filter" : {
                "my_delimiter" : {
                    "type" : "word_delimiter",
                    "split_on_numerics" : true,
                    "split_on_case_change" : true,
                    "my_delimiter.catenate_numbers" : true,
                    "generate_word_parts" : true,
                    "protected_words": ["C++", "C#"]
                }   
            }   
        }   
    }   
}' 

curl -XGET 'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other settings on the word_delimeter to meet your needs.




On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response, 

but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

Ivan Brusic
If you know the list of keywords to protect, you can also use a Keyword Marker Token Filter.



On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <[hidden email]> wrote:
thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :
You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },  
        "analysis" : {
            "filter" : {
                "my_delimiter" : {
                    "type" : "word_delimiter",
                    "split_on_numerics" : true,
                    "split_on_case_change" : true,
                    "my_delimiter.catenate_numbers" : true,
                    "generate_word_parts" : true,
                    "protected_words": ["C++", "C#"]
                }   
            }   
        }   
    }   
}' 

curl -XGET 'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other settings on the word_delimeter to meet your needs.




On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response, 

but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

simonw-2
Ivan, unfortunately the keywordMarkerFilter only works for in combination with stemmers.I added the keyword attribute years ago to prevent some stemmers from running the stemming alg on terms that are known to be names etc. I don't think this would help here. 
In general I would recommend to use a simple tokenizer like whitespace and then use synonym filter to transform these kind of token (c++ / c#) to a text represenations (cPLUSPLUS / CSHARP) then you can go wild with WordDelimiterFilter etc. once you did this mapping.

simon

On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:
If you know the list of keywords to protect, you can also use a Keyword Marker Token Filter.



On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="mjPf0KiRULQJ">pierre.d...@...> wrote:
thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :
You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },  
        "analysis" : {
            "filter" : {
                "my_delimiter" : {
                    "type" : "word_delimiter",
                    "split_on_numerics" : true,
                    "split_on_case_change" : true,
                    "my_delimiter.catenate_numbers" : true,
                    "generate_word_parts" : true,
                    "protected_words": ["C++", "C#"]
                }   
            }   
        }   
    }   
}' 

curl -XGET 'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other settings on the word_delimeter to meet your needs.




On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response, 

but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="mjPf0KiRULQJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: protect some words when tokenizing

Ivan Brusic
My mistake! I read the word "protect" and thought of the keyword marker filter. I once wrote a custom token filter on a Lucene project I was on, not related to stemming, that used the keyword attributes. Useful attribute, but it is post tokenization and not what the OP is looking for.

Nowadays in Lucene I use a pattern tokenizer since the whitespace tokenizer is too lenient, plus a word_delimiter filter (and stemmer overrides).

-- 
Ivan


On Sat, Feb 16, 2013 at 7:15 AM, simonw <[hidden email]> wrote:
Ivan, unfortunately the keywordMarkerFilter only works for in combination with stemmers.I added the keyword attribute years ago to prevent some stemmers from running the stemming alg on terms that are known to be names etc. I don't think this would help here. 
In general I would recommend to use a simple tokenizer like whitespace and then use synonym filter to transform these kind of token (c++ / c#) to a text represenations (cPLUSPLUS / CSHARP) then you can go wild with WordDelimiterFilter etc. once you did this mapping.

simon


On Friday, February 15, 2013 5:19:23 PM UTC+1, Ivan Brusic wrote:
If you know the list of keywords to protect, you can also use a Keyword Marker Token Filter.



On Tue, Feb 12, 2013 at 7:44 AM, Pierre De Soyres <[hidden email]> wrote:
thank you, this fits my needs

Le mardi 12 février 2013 16:11:37 UTC+1, egaumer a écrit :
You should be able to use a custom tuned word_delimeter to clean up unwanted punctuation...

egaumer@ares:(src)$ curl -XPUT 'http://localhost:9200/test' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },  
        "analysis" : {
            "filter" : {
                "my_delimiter" : {
                    "type" : "word_delimiter",
                    "split_on_numerics" : true,
                    "split_on_case_change" : true,
                    "my_delimiter.catenate_numbers" : true,
                    "generate_word_parts" : true,
                    "protected_words": ["C++", "C#"]
                }   
            }   
        }   
    }   
}' 

curl -XGET 'localhost:9200/test/_analyze?tokenizer=whitespace&filters=my_delimiter&pretty=1' -d 'Hello, I write C++ code for wi-fi.'

Test that out and see if it does what you need. You can tweak other settings on the word_delimeter to meet your needs.




On Tuesday, February 12, 2013 9:09:10 AM UTC-5, Pierre De Soyres wrote:
Thank you for response, 

but using 'whitespace' is not an option for me because I need comma, dot, dash, etc. to be delimiters as well

Pierre.

Le mardi 12 février 2013 14:47:23 UTC+1, egaumer a écrit :
You could use a whitespace tokenizer instead to preserve punctuation on this field...

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&pretty=1' -d 'I write C++ code.'



On Tuesday, February 12, 2013 8:19:57 AM UTC-5, Pierre de Soyres wrote:
Hi,

I use for some fields the standard tokenizer and I would like to know if there is a way to prevent strings such as "c++", "c#" or ".net" to be tokenized as "c", "c" or "net" but to be kept unmodified.

Thanks in advance

Pierre

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Loading...