excluding punctuation from fields

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

excluding punctuation from fields

Michael Sick
Hi All,

What's the best way (or tradeoffs) to exclude punctuation (or specific characters) from certain fields during analysis and searching?

i.e.
In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF Changs".

It looks like I could do this with the Synonym filter and the ICU plugin. Are there better options? Which is best or what are the tradeoffs?

Overall, if anyone knows of any resources that compare/contrast the various analyzers/filters/..., it would be very helpful.

Thanks for any advice/pointers,

--Mike
Reply | Threaded
Open this post in threaded view
|

Re: excluding punctuation from fields

Ivan Brusic
The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
<[hidden email]> wrote:

> Hi All,
>
> What's the best way (or tradeoffs) to exclude punctuation (or specific
> characters) from certain fields during analysis and searching?
>
> i.e.
> In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
> Changs".
>
>
> It looks like I could do this with the Synonym filter and the ICU plugin.
> Are there better options? Which is best or what are the tradeoffs?
>
> Overall, if anyone knows of any resources that compare/contrast the various
> analyzers/filters/..., it would be very helpful.
>
> Thanks for any advice/pointers,
>
> --Mike
Reply | Threaded
Open this post in threaded view
|

Re: excluding punctuation from fields

Michael Sick
Ivan,

Thanks! The Analysis API is priceless. Thanks,

--Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic <[hidden email]> wrote:
The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
<[hidden email]> wrote:
> Hi All,
>
> What's the best way (or tradeoffs) to exclude punctuation (or specific
> characters) from certain fields during analysis and searching?
>
> i.e.
> In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
> Changs".
>
>
> It looks like I could do this with the Synonym filter and the ICU plugin.
> Are there better options? Which is best or what are the tradeoffs?
>
> Overall, if anyone knows of any resources that compare/contrast the various
> analyzers/filters/..., it would be very helpful.
>
> Thanks for any advice/pointers,
>
> --Mike

Reply | Threaded
Open this post in threaded view
|

Re: excluding punctuation from fields

Michael Sick
In reply to this post by Ivan Brusic
I'm still having no luck on this. I've created a more self contained example for the behavior. In short, I'm storing a document with a field containing:

   "P.F. Changs Burgers"

Create & Run Test: https://gist.github.com/2792582


I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf" with regard to case. Currently only the 1st two work. My approach relies on using the Synonym filter for translating all forms above to "pf". I'd be happy to fix this approach or, even better, to learn that there's a general approach that will not require as much configuration. 

Thanks! --Mike

On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic <[hidden email]> wrote:
The standard filter should remove punctuation from tokens.

You can use the analysis API to view the differences between analyzers
(and unfortunately not between tokenizers or filters). The Lucene in
Action book has a summary of the different classes.

--
Ivan

On Sat, May 19, 2012 at 8:14 AM, Michael Sick
<[hidden email]> wrote:
> Hi All,
>
> What's the best way (or tradeoffs) to exclude punctuation (or specific
> characters) from certain fields during analysis and searching?
>
> i.e.
> In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
> Changs".
>
>
> It looks like I could do this with the Synonym filter and the ICU plugin.
> Are there better options? Which is best or what are the tradeoffs?
>
> Overall, if anyone knows of any resources that compare/contrast the various
> analyzers/filters/..., it would be very helpful.
>
> Thanks for any advice/pointers,
>
> --Mike

Reply | Threaded
Open this post in threaded view
|

Re: excluding punctuation from fields

Ivan Brusic
I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.

I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'

index :
    analysis :
        analyzer :
            unstemmed :
                type : custom
                filter : [unique , standard, asciifolding, lowercase,]
                char_filter : [punctuation]
        char_filter :
            punctuation :
                type: mapping
                mappings: [".=>"]


On Fri, May 25, 2012 at 11:41 PM, Michael Sick
<[hidden email]> wrote:

> I'm still having no luck on this. I've created a more self contained example
> for the behavior. In short, I'm storing a document with a field containing:
>
>    "P.F. Changs Burgers"
>
> Create & Run Test: https://gist.github.com/2792582
> Delete Artifacts: https://gist.github.com/2792590
>
>
> I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
> with regard to case. Currently only the 1st two work. My approach relies on
> using the Synonym filter for translating all forms above to "pf". I'd be
> happy to fix this approach or, even better, to learn that there's a general
> approach that will not require as much configuration.
>
> Thanks! --Mike
>
> On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic <[hidden email]> wrote:
>>
>> The standard filter should remove punctuation from tokens.
>>
>> You can use the analysis API to view the differences between analyzers
>> (and unfortunately not between tokenizers or filters). The Lucene in
>> Action book has a summary of the different classes.
>>
>> --
>> Ivan
>>
>> On Sat, May 19, 2012 at 8:14 AM, Michael Sick
>> <[hidden email]> wrote:
>> > Hi All,
>> >
>> > What's the best way (or tradeoffs) to exclude punctuation (or specific
>> > characters) from certain fields during analysis and searching?
>> >
>> > i.e.
>> > In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
>> > Changs".
>> >
>> >
>> > It looks like I could do this with the Synonym filter and the ICU
>> > plugin.
>> > Are there better options? Which is best or what are the tradeoffs?
>> >
>> > Overall, if anyone knows of any resources that compare/contrast the
>> > various
>> > analyzers/filters/..., it would be very helpful.
>> >
>> > Thanks for any advice/pointers,
>> >
>> > --Mike
>
>
Reply | Threaded
Open this post in threaded view
|

Re: excluding punctuation from fields

Michael Sick
Thanks Ivan - I'll give that a shot.


On Wed, May 30, 2012 at 2:21 PM, Ivan Brusic <[hidden email]> wrote:
I just occurred to me as I was testing things that the standard filter
does NOT remove punctuation. My custom filter in Lucene was stripping
punctuation, not the standard filter.

I was able to remove punctuation by using a mapping char_filter. Mine
simply removes dots '.'

index :
   analysis :
       analyzer :
           unstemmed :
               type : custom
               filter : [unique , standard, asciifolding, lowercase,]
               char_filter : [punctuation]
       char_filter :
           punctuation :
               type: mapping
               mappings: [".=>"]


On Fri, May 25, 2012 at 11:41 PM, Michael Sick
<[hidden email]> wrote:
> I'm still having no luck on this. I've created a more self contained example
> for the behavior. In short, I'm storing a document with a field containing:
>
>    "P.F. Changs Burgers"
>
> Create & Run Test: https://gist.github.com/2792582
> Delete Artifacts: https://gist.github.com/2792590
>
>
> I'd like ES to provide a match if I search on "p.f.", "p.f", "pf.", "pf"
> with regard to case. Currently only the 1st two work. My approach relies on
> using the Synonym filter for translating all forms above to "pf". I'd be
> happy to fix this approach or, even better, to learn that there's a general
> approach that will not require as much configuration.
>
> Thanks! --Mike
>
> On Sun, May 20, 2012 at 4:35 PM, Ivan Brusic <[hidden email]> wrote:
>>
>> The standard filter should remove punctuation from tokens.
>>
>> You can use the analysis API to view the differences between analyzers
>> (and unfortunately not between tokenizers or filters). The Lucene in
>> Action book has a summary of the different classes.
>>
>> --
>> Ivan
>>
>> On Sat, May 19, 2012 at 8:14 AM, Michael Sick
>> <[hidden email]> wrote:
>> > Hi All,
>> >
>> > What's the best way (or tradeoffs) to exclude punctuation (or specific
>> > characters) from certain fields during analysis and searching?
>> >
>> > i.e.
>> > In document, "P.F. Changs" would match a search for "P.F. Changs" or "PF
>> > Changs".
>> >
>> >
>> > It looks like I could do this with the Synonym filter and the ICU
>> > plugin.
>> > Are there better options? Which is best or what are the tradeoffs?
>> >
>> > Overall, if anyone knows of any resources that compare/contrast the
>> > various
>> > analyzers/filters/..., it would be very helpful.
>> >
>> > Thanks for any advice/pointers,
>> >
>> > --Mike
>
>