Changing tokenizer from whitespace to standard

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Changing tokenizer from whitespace to standard

Andy Bajka-2
Looks like using "whitespace" doesn't work very well for my forum searches as we often search for alphanumeric words. When I do a search for example:

test12345678

I get back thousands of results when I should get back only one.

I assume that if I change the "whitespace" to "standard" this will correct the problem. Here is a portion of my analyzer code.


    "settings" : {
        "index" : {
            "number_of_shards" : 5,
            "number_of_replicas" : 0
        },
        "analysis" : {
            "filter" : {
                "tweet_filter" : {
                    "type" : "word_delimiter",
                    "type_table": ["( => ALPHA", ") => ALPHA"]
                }
            },
            "analyzer" : {
                "tweet_analyzer" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase", "tweet_filter"]
                }
            }
        }
    },

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Changing tokenizer from whitespace to standard

Andy Bajka-2
I changed it from whitespace to standard and re-indexed, unfortunately that didn't help.

I'm going to go back to whitespace and for now only allow alpha characters to be searched with the exception of parenthesis.

Hopefully someone with expertise will have a better solution.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Changing tokenizer from whitespace to standard

Alexander Reelsen-2
Hey,

can you a show two sample documents (one which is returned correctly, one which is not returned correctly) as well as your query in order to debug your problem?

Also, you should checkout the analyzer API which allows you to see, how strings are tokenized

curl 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'test12345678'
curl 'localhost:9200/_analyze?analyzer=whitespace&pretty' -d 'test12345678'

Both outputs do not differ, so it is clear that your change did not have any effect. See more at 

Also you might want to install the excellect inquisitor plugin, so you have a nice web gui for analyzing stuff, see


--Alex



On Tue, Apr 16, 2013 at 7:27 PM, Andy Bajka <[hidden email]> wrote:
I changed it from whitespace to standard and re-indexed, unfortunately that didn't help.

I'm going to go back to whitespace and for now only allow alpha characters to be searched with the exception of parenthesis.

Hopefully someone with expertise will have a better solution.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Changing tokenizer from whitespace to standard

vallabh
Hi Everyone,

I am using analysis-phonetic plugin for searching which follows lucene.

I do have artist names stores as ke$ha, !!! (chk chk chk), Jay-Z ans so on.

For exclamation mark, i am escaping those special characters because exclamation mark breaks the query  and use synonyn method to match ke$ha and  !!! (chk chk chk) and set "tokenizer" : "whitespace".
In this case, i am searching the text as kesha (without $) i am getting the expected result as ke$ha.
and  when i search  !!! (3 exclamation) i am getting !!! (chk chk chk).

But the thing is that,
For Jay-Z, i wanted to search as jay z (without hyphen and with space in between).
But this works when i set "tokenizer" : "standard".

And when i set "tokenizer" : "standard" then kesha and exclamation do not work.

I wanted to use both tokenizer together.
I think this is possible with custom tokenizer. But unable to develop due to new in elasticsearch.

I have created 2 files,
1. process.sh - where i am doing indexing

echo 'Delete the index.'
curl -X DELETE 'http://localhost:9200/admin/?pretty=true'
 
echo; echo
echo 'Create the index.'

curl -X PUT 'http://localhost:9200/admin/?pretty=true' -d '
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "artist_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["standard", "lowercase", "synonym", "artist_metaphone", "asciifolding"]
                }
            },
            "filter" : {
                "artist_metaphone" : {
                    "type" : "phonetic",
                    "encoder" : "metaphone",
                    "replace" : false
                },
                "synonym" : {
                    "type" : "synonym",
                    "synonyms_path" : "/var/www/html/elasticsearch-master/synonyms.txt"
                }
            }
        }
    }
}
'

echo; echo
echo 'Create the mapping.'
curl -X PUT 'http://localhost:9200/admin/jos_artist_details/_mapping?pretty=true' -d '
{
  "jos_artist_details" : {
    "properties" : {    
      "name" : {
      "type": "string",
      "index_analyzer": "artist_analyzer",
      "search_analyzer": "artist_analyzer"
      }
   
    }
  }
}
'

2. artist_display.php - where i am searching and displaying the data

$es = Client::connection(array(
        'servers' => '127.0.0.1:9200',
        'protocol' => 'http',
        'index' => 'admin',
        'type' => 'jos_artist_details'
));

$result = $es->search(array(
                                      "query" => array(
                                        "dis_max" => array(
                                          "queries" => array(
                                            0 => array(
                                                    "field" => array(
                                                    "name" => $search
                                                )
                                            )
                                      )
                              )
                      ),
                      "from" => 0,
                      "size" => 100000
                )
        );
       
        $total = $result['hits']['total'];
        $data = $result['hits']['hits'];

Any help is very much aprreciated.
Thanks,
Reply | Threaded
Open this post in threaded view
|

Re: Changing tokenizer from whitespace to standard

Sanjeev Kumar
This post has NOT been accepted by the mailing list yet.
In reply to this post by Andy Bajka-2