Removing whitespace around a delimiter in a custom anaylzer

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing whitespace around a delimiter in a custom anaylzer

Rick Thomas
I'm having difficulty with a custom analyzer.  I have a field in my
index that looks like this [I am a token, I'm a token too, Tokenize
me,This is a token,Tokenize me]

Instead of creating term facets based on spaces, I want to create term
facets based on commas.  I also need to remove any whitespace around
the comma.   Here is my analyzer:

"analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\\s*,\
\s*"}}}

It works in that it tokenizes string based on commas, but it is
including trailing and leading whitespace in the tokens.  I need to
get rid of that whitespace.  The regex I'm using is supposed to do
that (it should pick up 0 to n spaces on either side of the comma),
but it is not doing that.

Any thoughts on how I can force the regex engine to be greedier in its
analysis?

Thanks,

Rick
Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Karussell
probably have a look into worddelimiter filter

Peter.

On 7 Feb., 17:25, Rick Thomas <[hidden email]> wrote:

> I'm having difficulty with a custom analyzer.  I have a field in my
> index that looks like this [I am a token, I'm a token too, Tokenize
> me,This is a token,Tokenize me]
>
> Instead of creating term facets based on spaces, I want to create term
> facets based on commas.  I also need to remove any whitespace around
> the comma.   Here is my analyzer:
>
> "analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\\s*,\
> \s*"}}}
>
> It works in that it tokenizes string based on commas, but it is
> including trailing and leading whitespace in the tokens.  I need to
> get rid of that whitespace.  The regex I'm using is supposed to do
> that (it should pick up 0 to n spaces on either side of the comma),
> but it is not doing that.
>
> Any thoughts on how I can force the regex engine to be greedier in its
> analysis?
>
> Thanks,
>
> Rick
Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Rick Thomas
That appears to do the opposite of what I need.

On Feb 7, 3:15 pm, Karussell <[hidden email]> wrote:

> probably have a look into worddelimiter filter
>
> Peter.
>
> On 7 Feb., 17:25, Rick Thomas <[hidden email]> wrote:
>
>
>
>
>
>
>
> > I'm having difficulty with a custom analyzer.  I have a field in my
> > index that looks like this [I am a token, I'm a token too, Tokenize
> > me,This is a token,Tokenize me]
>
> > Instead of creating term facets based on spaces, I want to create term
> > facets based on commas.  I also need to remove anywhitespacearound
> > the comma.   Here is my analyzer:
>
> > "analysis":{"analyzer":{"comma":{"type":"pattern","pattern":"\\s*,\
> > \s*"}}}
>
> > It works in that it tokenizes string based on commas, but it is
> > including trailing and leadingwhitespacein the tokens.  I need to
> > get rid of thatwhitespace.  The regex I'm using is supposed to do
> > that (it should pick up 0 to n spaces on either side of the comma),
> > but it is not doing that.
>
> > Any thoughts on how I can force the regex engine to be greedier in its
> > analysis?
>
> > Thanks,
>
> > Rick
Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Karussell
> That appears to do the opposite of what I need.

I think you can hack this worddelimiter thing a lot. E.g. overwriting
the comma char to be recognized as SUBWORD_DELIM (see type_table)


> I have a field in my index that looks like this

do you mean the field or the original data?


> Any thoughts on how I can force the regex engine to be greedier in its analysis?

No idea, I'm avoiding regex when and where I can :)

so I would do this via a custom WhitespaceTokenizer which overwrites
isTokenChar

Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Rick Thomas
The original data looks like this:
I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me

Your guidance is very much appreciated.

I tried this WordDelimiter filter as part of a custom analyzer, and
all it did was tokenize based on whitespace.  Is there more
information on how to use the type_table field?  What tokenizer should
a custom analyzer that specifies a filter use?

"filter":{"comma_delimiter":{"type":"word_delimited","type_table":
{",":"SUBWORD_DELIM"}}}

I feel like the solution should be easier than we're making it.


On Feb 8, 8:26 am, Karussell <[hidden email]> wrote:

> > That appears to do the opposite of what I need.
>
> I think you can hack this worddelimiter thing a lot. E.g. overwriting
> the comma char to be recognized as SUBWORD_DELIM (see type_table)
>
> > I have a field in my index that looks like this
>
> do you mean the field or the original data?
>
> > Any thoughts on how I can force the regex engine to be greedier in its analysis?
>
> No idea, I'm avoiding regex when and where I can :)
>
> so I would do this via a custom WhitespaceTokenizer which overwrites
> isTokenChar
>
> Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Clinton Gormley-2
On Wed, 2012-02-08 at 09:38 -0800, Rick Thomas wrote:
> The original data looks like this:
> I am a token, I'm a token too, Tokenize me,This is a token,Tokenize me


your original example works for me:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "comma" : {
               "pattern" : "\\s*,\\s*",
               "type" : "pattern"
            }
         }
      }
   }
}
'


curl -XGET 'http://127.0.0.1:9200/foo/_analyze?pretty=1&text=I+am+a+token%2C+I'm+a+token+too%2C+Tokenize+me%2CThis+is+a+token%2CTokenize+me&analyzer=comma' 

# [Wed Feb  8 18:54:41 2012] Response:
# {
#    "tokens" : [
#       {
#          "end_offset" : 12,
#          "position" : 1,
#          "start_offset" : 0,
#          "type" : "word",
#          "token" : "i am a token"
#       },
#       {
#          "end_offset" : 29,
#          "position" : 2,
#          "start_offset" : 14,
#          "type" : "word",
#          "token" : "i'm a token too"
#       },
#       {
#          "end_offset" : 42,
#          "position" : 3,
#          "start_offset" : 31,
#          "type" : "word",
#          "token" : "tokenize me"
#       },
#       {
#          "end_offset" : 58,
#          "position" : 4,
#          "start_offset" : 43,
#          "type" : "word",
#          "token" : "this is a token"
#       },
#       {
#          "end_offset" : 70,
#          "position" : 5,
#          "start_offset" : 59,
#          "type" : "word",
#          "token" : "tokenize me"
#       }
#    ]
# }


Perhaps you need to give a working example (as above) showing exactly
what you are doing, the results you are getting and what is wrong with
those results

clint


Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Rick Thomas
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extra whitespace around the tokens that
you don't get when you call _analyze.  To use real world data:

item 1: category: "foo bar  "
item 2: category: "foo bar, ding bar   "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:

https://gist.github.com/1773423

Any help getting rid of the whitespace around the tokens would be much
appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Rick Thomas
Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <[hidden email]> wrote:

> Everything seems to work fine with the basic analysis, but when you
> introduce faceting, you get extrawhitespacearound the tokens that
> you don't get when you call _analyze.  To use real world data:
>
> item 1: category: "foo bar  "
> item 2: category: "foo bar, ding bar   "
>
> This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
> "
>
> In reality, I need 2 distinct tokens "foo bar" "ding bar"
>
> I've put together a gist to recreate the issue:
>
> https://gist.github.com/1773423
>
> Any help getting rid of thewhitespacearound the tokens would be much
> appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

kimchy
Administrator
On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...

On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <rtho...@igodigital.com> wrote:
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze.  To use real world data:

item 1: category: "foo bar  "
item 2: category: "foo bar, ding bar   "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:


Any help getting rid of thewhitespacearound the tokens would be much
appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

kimchy
Administrator

On Sunday, February 12, 2012 at 1:33 PM, Shay Banon wrote:

On my end, not a regex expert, so not sure why whitepsaces are not removed based on your regular expression. Require some playing to get it done properly. One thing we can do is add an analyzer token filter that can trim whitespaces, might make things simpler...

On Thursday, February 9, 2012 at 11:44 PM, Rick Thomas wrote:

Is there anything I can do to simplify recreating the problem for
those who are more knowledgable?

On Feb 8, 3:41 pm, Rick Thomas <rtho...@igodigital.com> wrote:
Everything seems to work fine with the basic analysis, but when you
introduce faceting, you get extrawhitespacearound the tokens that
you don't get when you call _analyze.  To use real world data:

item 1: category: "foo bar  "
item 2: category: "foo bar, ding bar   "

This will create 3 distinct tokens: "foo bar", "foo bar ", "ding bar
"

In reality, I need 2 distinct tokens "foo bar" "ding bar"

I've put together a gist to recreate the issue:


Any help getting rid of thewhitespacearound the tokens would be much
appreciated.


Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Clinton Gormley-2
In reply to this post by Rick Thomas

> I've put together a gist to recreate the issue:
>
> https://gist.github.com/1773423
>
> Any help getting rid of the whitespace around the tokens would be much
> appreciated.

OK - so the comma analyzer is actually removing whitespace around the
comma.  The problem is that you have whitespace at the beginning or end
of your strings, where no commas are involved - that's where the
whitespace is coming from.

This works:

curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "comma" : {
               "pattern" : "^\\s+|\\s*,\\s*|\\s+$",
               "type" : "pattern"
            }
         }
      }
   }
}
'

clint

Reply | Threaded
Open this post in threaded view
|

Re: Removing whitespace around a delimiter in a custom anaylzer

Rick Thomas
Thanks so much for the assistance!

On Feb 13, 5:05 am, Clinton Gormley <[hidden email]> wrote:

> > I've put together a gist to recreate the issue:
>
> >https://gist.github.com/1773423
>
> > Any help getting rid of thewhitespacearound the tokens would be much
> > appreciated.
>
> OK - so the comma analyzer is actually removingwhitespacearound the
> comma.  The problem is that you havewhitespaceat the beginning or end
> of your strings, where no commas are involved - that's where thewhitespaceis coming from.
>
> This works:
>
> curl -XPUT 'http://127.0.0.1:9200/foo/?pretty=1' -d '
> {
>    "settings" : {
>       "analysis" : {
>          "analyzer" : {
>             "comma" : {
>                "pattern" : "^\\s+|\\s*,\\s*|\\s+$",
>                "type" : "pattern"
>             }
>          }
>       }
>    }}
>
> '
>
> clint