Special Characters not indexed and hence not searchable

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Special Characters not indexed and hence not searchable

Praveen Kariyanahalli
I saw many threads discuss it. I have crossed few hurdles, last one is still bothering me. I had email address, ":", "/" in my data (which need to be indexed and searched). Now I am able to search the email address [hidden email], but I still cannot have the following characters indexed: ":" "/" and "-". Any help is greatly appreciated. Is it issue with my index_analyzer or search_analyzer? 

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type"     :"custom",
                            "tokenizer":"uax_url_email",
                            "filter"   : ["mynGram"]
                        }
                    }
                }
            }


ESMAPPINGS = {
                "index_analyzer"  : "a1",
                "search_analyzer" : "whitespace",
                "properties" : {
                    u'test_field1' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'testfield2' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'email' : {
                        'index': 'not_analyzed',
                        'type' : u'string',
                        'store': 'yes'
                    },
:::::::::::
}
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Joe Wong
Do you mean you want to have ":", "/" searchable? 
if so you may have to use a custom tokenizer and specify a pattern to tokenize. 

This will include all unicode characters plus the special characters such as "/" and ":"

'tokenizer' : {

            'email_tokenizer' : {

                'type' : { 'pattern',

                'pattern' => "[@*\/*:*\.*\\w\\p{L}]+"

                 }

            }

        }



On Thursday, July 26, 2012 2:28:46 PM UTC-4, Praveen Kariyanahalli wrote:
I saw many threads discuss it. I have crossed few hurdles, last one is still bothering me. I had email address, ":", "/" in my data (which need to be indexed and searched). Now I am able to search the email address [hidden email], but I still cannot have the following characters indexed: ":" "/" and "-". Any help is greatly appreciated. Is it issue with my index_analyzer or search_analyzer? 

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type"     :"custom",
                            "tokenizer":"uax_url_email",
                            "filter"   : ["mynGram"]
                        }
                    }
                }
            }


ESMAPPINGS = {
                "index_analyzer"  : "a1",
                "search_analyzer" : "whitespace",
                "properties" : {
                    u'test_field1' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'testfield2' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'email' : {
                        'index': 'not_analyzed',
                        'type' : u'string',
                        'store': 'yes'
                    },
:::::::::::
}
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Praveen Kariyanahalli
Hi Joe

As per you suggestion, I changed my tokenizer to the following. But it doesnt help. I have lost my initial ngram indexing too? Am I missing something?

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type" : "pattern",
                            "pattern" : "[@\/\:\.\\w\\p{L}]+",
                            "filter"   : ["mynGram"]
                        }
                    }
                }
            }

I did not understand the significance of the '*' in your pattern. Also you are saying 'Letters' and 'word' and mention any number of occurrences of that (the end +?).

Can you please clarify?

Thanks
-Praveen
 

On Thursday, July 26, 2012 1:53:57 PM UTC-7, Joe Wong wrote:
Do you mean you want to have ":", "/" searchable? 
if so you may have to use a custom tokenizer and specify a pattern to tokenize. 

This will include all unicode characters plus the special characters such as "/" and ":"

'tokenizer' : {

            'email_tokenizer' : {

                'type' : { 'pattern',

                'pattern' => "[@*\/*:*\.*\\w\\p{L}]+"

                 }

            }

        }



On Thursday, July 26, 2012 2:28:46 PM UTC-4, Praveen Kariyanahalli wrote:
I saw many threads discuss it. I have crossed few hurdles, last one is still bothering me. I had email address, ":", "/" in my data (which need to be indexed and searched). Now I am able to search the email address [hidden email], but I still cannot have the following characters indexed: ":" "/" and "-". Any help is greatly appreciated. Is it issue with my index_analyzer or search_analyzer? 

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type"     :"custom",
                            "tokenizer":"uax_url_email",
                            "filter"   : ["mynGram"]
                        }
                    }
                }
            }


ESMAPPINGS = {
                "index_analyzer"  : "a1",
                "search_analyzer" : "whitespace",
                "properties" : {
                    u'test_field1' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'testfield2' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'email' : {
                        'index': 'not_analyzed',
                        'type' : u'string',
                        'store': 'yes'
                    },
:::::::::::
}
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Praveen Kariyanahalli
Just in case I was not clear,

I want my tokens to be anything that is made of letters, digits, @, -, ., :, /

Here is the pattern I am trying:                             "pattern" : "[@\-\.\:\/\\p{L}\\d]+"

On this token I need ngram filter.

Any help is greatly appreciated.

Thanks in Advnace
-pk


ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type" : "pattern",
                            "filter"   : ["mynGram"],
                            "pattern" : "[@\-\.\:\/\\p{L}\\d]+"
                        }
                    }
                }
            }

ESMAPPINGS = {
                "index_analyzer"  : "a1",
                "search_analyzer" : "whitespace",
                "date_formats"    : ["yyyy-MM-dd", "MM-dd-yyyy"],
                "properties" : {
                    u'my_field' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
::::::::::::::
}
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Joe Wong
In reply to this post by Praveen Kariyanahalli
ah yes '*' aren't required

you need to define the custom tokenizer for your analyzer

i.e.

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type"     :"custom",
                            "tokenizer":"email_tokenizer",
                            "filter"   : ["mynGram"]
                        }
                    },
                    "tokenizer" : {
        "email_tokenizer" : {
                            "type" : 'pattern',
               "pattern" => "[@\/:\.\\w\\p{L}]+"
}
                     }
                  }
             }



On Thursday, July 26, 2012 5:57:36 PM UTC-4, Praveen Kariyanahalli wrote:
Hi Joe

As per you suggestion, I changed my tokenizer to the following. But it doesnt help. I have lost my initial ngram indexing too? Am I missing something?

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type" : "pattern",
                            "pattern" : "[@\/\:\.\\w\\p{L}]+",
                            "filter"   : ["mynGram"]
                        }
                    },
 
                }
            }

I did not understand the significance of the '*' in your pattern. Also you are saying 'Letters' and 'word' and mention any number of occurrences of that (the end +?).

Can you please clarify?

Thanks
-Praveen
 

On Thursday, July 26, 2012 1:53:57 PM UTC-7, Joe Wong wrote:
Do you mean you want to have ":", "/" searchable? 
if so you may have to use a custom tokenizer and specify a pattern to tokenize. 

This will include all unicode characters plus the special characters such as "/" and ":"

'tokenizer' : {

            'email_tokenizer' : {

                'type' : { 'pattern',

                'pattern' => "[@*\/*:*\.*\\w\\p{L}]+"

                 }

            }

        }



On Thursday, July 26, 2012 2:28:46 PM UTC-4, Praveen Kariyanahalli wrote:
I saw many threads discuss it. I have crossed few hurdles, last one is still bothering me. I had email address, ":", "/" in my data (which need to be indexed and searched). Now I am able to search the email address [hidden email], but I still cannot have the following characters indexed: ":" "/" and "-". Any help is greatly appreciated. Is it issue with my index_analyzer or search_analyzer? 

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type"     :"custom",
                            "tokenizer":"uax_url_email",
                            "filter"   : ["mynGram"]
                        }
                    }
                }
            }


ESMAPPINGS = {
                "index_analyzer"  : "a1",
                "search_analyzer" : "whitespace",
                "properties" : {
                    u'test_field1' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'testfield2' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'email' : {
                        'index': 'not_analyzed',
                        'type' : u'string',
                        'store': 'yes'
                    },
:::::::::::
}
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Praveen Kariyanahalli
This finally worked (see below).  I had to negate the pattern that I want to be tokenized. It looks like pattern is defining what my delimiter is. So in my case keyword pattern is saying,  tokenize until you see character other than @, :, /, ., !, =, -, letter or digit, then go on to apply filter on them. I reread the documentation, then I got it: http://www.elasticsearch.org/guide/reference/index-modules/analysis/pattern-analyzer.html

                    "tokenizer" : {
                        "email_tokenizer" : {
                            "type" : "pattern",
                            "pattern" : "[^@:\/\.\!\=\-\\w\\p{L}\\d]+"
                        }
                    },
                    "analyzer": {
                        "a1" : {
                            "type" : "custom",
                            "tokenizer":"email_tokenizer",
                            "filter"   : ["mynGram"]
                        }
                    }

Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Anusha
This post has NOT been accepted by the mailing list yet.
Am using the pattern "[^@:\/\.\!\=\-\\w\\p{L}\\d]+" in sense for settings. It is showing showing Bad string sytax error. Do I need to add anything for this pattern inorder to accept the string in sense.
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Anusha
This post has NOT been accepted by the mailing list yet.
Where I would like to add special characters like'-','/','(',')' to my pattern..
Reply | Threaded
Open this post in threaded view
|

Re: Special Characters not indexed and hence not searchable

Anusha
In reply to this post by Praveen Kariyanahalli
Am using the pattern "[^@:\/\.\!\=\-\\w\\p{L}\\d]+" in sense for settings. It is showing showing Bad string sytax error. Do I need to add anything for this pattern inorder to accept the string in sense.
Where I would like to add special characters like'-','/','(',')' to my pattern..


Here is my settings:
 "analysis": {
            "analyzer": {
             "my_analyzer":
             {
                 "type":"custom",
                 "tokenizer":"special_tokenizer",
                 "filter"   : ["mynGram"]
             }
            },
            "tokenizer": {
                "special_tokenizer":
                {
                    "type" : "pattern",
                    "pattern" : "[^-\/\\w\\p{L}\\d]+"                                             Here am getting Bad String syntax error in sense , any other way of giving the string
                }
            },
            "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    }       
                   
        }

On Thursday, July 26, 2012 at 11:58:46 PM UTC+5:30, Praveen Kariyanahalli wrote:
I saw many threads discuss it. I have crossed few hurdles, last one is still bothering me. I had email address, ":", "/" in my data (which need to be indexed and searched). Now I am able to search the email address <a href="javascript:" target="_blank" gdf-obfuscated-mailto="irTsoXHkiQAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">te...@..., but I still cannot have the following characters indexed: ":" "/" and "-". Any help is greatly appreciated. Is it issue with my index_analyzer or search_analyzer? 

Thanks in Advance
-Praveen

Here is my mapping:

ESINDEX =   {
                "number_of_shards": 1,
                "analysis": {
                   "filter": {
                      "mynGram" : {
                          "type"    : "nGram",
                          "min_gram": 1,
                          "max_gram": 50
                      }
                    },
                    "analyzer": {
                        "a1" : {
                            "type"     :"custom",
                            "tokenizer":"uax_url_email",
                            "filter"   : ["mynGram"]
                        }
                    }
                }
            }


ESMAPPINGS = {
                "index_analyzer"  : "a1",
                "search_analyzer" : "whitespace",
                "properties" : {
                    u'test_field1' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'testfield2' : {
                        'index'  : 'not_analyzed',
                        'type'   : u'string',
                        'store'  : 'yes'
                    },
                    u'email' : {
                        'index': 'not_analyzed',
                        'type' : u'string',
                        'store': 'yes'
                    },
:::::::::::
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c431b584-a4a2-4b86-89ce-b9cce43d3e79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.