Question on stemming + synonyms and tokenizerFactory

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on stemming + synonyms and tokenizerFactory

Loren
I have an analysis chain like this for some Spanish text:
standard asciifolding lowercase es_stop_filter es_stem_filter es_synonyms

With synonyms at the end, after all the other filters, I have to define my synonyms in their stemmed, ASCII-folded, lowercase forms. So instead of defining a synonym set like "vacuna, vacunaciĆ³n, inmunizaciĆ³n", I have to define it as "vacun, vacunacion, inmunizacion".

In the case of a very aggressive stemmer like Snowball for English, we would have to define "intern, global" as a synonym mapping when we'd really want to write "international, global". 

This is a little counter-intuitive for the folks who define our synonyms, as they think in dictionary terms and not stemmed tokens, and need to have access to a "standard asciifolding lowercase es_stop_filter es_stem_filter" analysis chain to apply everything but the synonym filter in order to see what tokens to specify in the synonyms file.

In this blog post about Solr, the author mentions that one could define a "custom tokenizer that returns the stemmed form of words from the synonyms file" to get around this. Is it possible to configure Elasticsearch this way?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7009182-9577-4580-872a-1b121be3457d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Question on stemming + synonyms and tokenizerFactory

InquiringMind
Once you have your mapping set up, then create an application that itself constructs the analyzer you need. Then feed it your real words and let it generate the stemmed versions.

I don't think that ES can be told to do this; but it provides the classes you need to do it yourself.

For my own synonym processing, I do a Very Bad Thing. I create a synonym _type and then each document contains a list of words or phrases that are synonyms of each other. For a synonym query, I first query my synonym type. Then I OR the queries for each of the matching synonym words or phrases.

This is also much easier to maintain: I can update the synonyms on the fly and do not need to reindex the data at all. Not at all.

But it requires additional code, and it works best using the Java API. And some folks have indicated there are serious performance issues making this a Bad Solution. But I have not seen any problems with performance.

Oh, and all my words and phrases can be fully spelled out; it's only when they are used in the subsequent query do they get analyzed (tokenized, stemmed, and whatever else).

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e5a984d2-4f30-4e78-b1ba-1dc27febdfd3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.