Question on stemming + synonyms and tokenizerFactory
I have an analysis chain like this for some Spanish text:
standard asciifolding lowercase es_stop_filter es_stem_filter es_synonyms
With synonyms at the end, after all the other filters, I have to define my synonyms in their stemmed, ASCII-folded, lowercase forms. So instead of defining a synonym set like "vacuna, vacunación, inmunización", I have to define it as "vacun, vacunacion, inmunizacion".
In the case of a very aggressive stemmer like Snowball for English, we would have to define "intern, global" as a synonym mapping when we'd really want to write "international, global".
This is a little counter-intuitive for the folks who define our synonyms, as they think in dictionary terms and not stemmed tokens, and need to have access to a "standard asciifolding lowercase es_stop_filter es_stem_filter" analysis chain to apply everything but the synonym filter in order to see what tokens to specify in the synonyms file.
In this blog post about Solr, the author mentions that one could define a "custom tokenizer that returns the stemmed form of words from the synonyms file" to get around this. Is it possible to configure Elasticsearch this way?
Re: Question on stemming + synonyms and tokenizerFactory
Once you have your mapping set up, then create an application that itself constructs the analyzer you need. Then feed it your real words and let it generate the stemmed versions.
I don't think that ES can be told to do this; but it provides the classes you need to do it yourself.
For my own synonym processing, I do a Very Bad Thing. I create a synonym _type and then each document contains a list of words or phrases that are synonyms of each other. For a synonym query, I first query my synonym type. Then I OR the queries for each of the matching synonym words or phrases.
This is also much easier to maintain: I can update the synonyms on the fly and do not need to reindex the data at all. Not at all.
But it requires additional code, and it works best using the Java API. And some folks have indicated there are serious performance issues making this a Bad Solution. But I have not seen any problems with performance.
Oh, and all my words and phrases can be fully spelled out; it's only when they are used in the subsequent query do they get analyzed (tokenized, stemmed, and whatever else).