char_filter for German

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

char_filter for German

Krešimir Slugan
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Itamar Syn-Hershko

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <[hidden email]> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

"Jürgen Wagner (DVT)"
Hello Kresimir,
  as a native speaker of German and a linguist, I know you usually want to preserve the umlaut, but for searches you may want to relax the precision of matching. So, why not do precisely this? If you have "über" or "ueber" in your query, replace it by "über OR ueber". And if you want to take care of those Americans who believe these two dots do not carry any meaning at all (heavy grin at this point), you may add even "OR uber". Syntactically, "uber" is wrong. This would only be a convenience rule for users thinking they can simply omit umlaut dots or who are incapable of typing umlaut characters on their keyboards.

Note: when it comes to German last names, the names Ganser, Gänser and Gaenser would be considered three entirely different names, although the alternative spelling (e.g., in plain e-mail addresses) of Gänser could be Gaenser. Mapping umlauts will get you false positives.

Also be careful with the reverse. "ue", "oe" and "ae" cannot simply be spelled as "ü", "ö" or "ä". In a word like "Zooeingang" (zoo entrance), the composite is actually made of "Zoo" and "Eingang", so the "oe" must not be interpreted as "ö".

Similar issues exist with "ß" and "ss".

Well, most likely these funny cases won't matter too much, so I suggest to try with a simple disjunctive expansion for a start.

Best regards,
--Jürgen

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <[hidden email]> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/547A052F.4010302%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.

juergen_wagner.vcf (388 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
In reply to this post by Itamar Syn-Hershko
Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve original with ASCIIfolding but since char_filter is applied before ASCIIfolding then there would not be any umlauts to fold :) If I could apply char_filter on the end that would be ok, or preserve original with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:
You may find the approach I give in the end of this talk helpful: <a href="https://skillsmatter.com/skillscasts/4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fskillsmatter.com%2Fskillscasts%2F4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene\46sa\75D\46sntz\0751\46usg\75AFQjCNHzH65Npvm98QI3NdzNksrzc8fQfA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fskillsmatter.com%2Fskillscasts%2F4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene\46sa\75D\46sntz\0751\46usg\75AFQjCNHzH65Npvm98QI3NdzNksrzc8fQfA';return true;">https://skillsmatter.com/skillscasts/4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene

--

Itamar Syn-Hershko
<a href="http://code972.com/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fcode972.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH_0ahlIREvy79st9arcLSClMBpEw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fcode972.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH_0ahlIREvy79st9arcLSClMBpEw';return true;">http://code972.com | <a href="https://twitter.com/synhershko" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fsynhershko\46sa\75D\46sntz\0751\46usg\75AFQjCNGBL9AV5Pm4wDx-6dKwWnd_Vfn1gQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fsynhershko\46sa\75D\46sntz\0751\46usg\75AFQjCNGBL9AV5Pm4wDx-6dKwWnd_Vfn1gQ';return true;">@synhershko
Freelance Developer & Consultant
Author of <a href="http://manning.com/synhershko/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fsynhershko%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNEtS9J7IelY2CGG_5cda5-SPQNhpQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fsynhershko%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNEtS9J7IelY2CGG_5cda5-SPQNhpQ';return true;">RavenDB in Action

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="hAFDowggpj4J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kresimi...@...> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="hAFDowggpj4J" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
In reply to this post by "Jürgen Wagner (DVT)"
 Hi Jürgen,
 
I'm aware that mapping umlauts gets many false positives, but we have noticed that some of our users omit them while searching. I guess we'll have to make product decision there because we can not cover all use cases anyway.

Thanks for your response!

Best,

Kresimir


On Saturday, November 29, 2014 6:41:17 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hello Kresimir,
  as a native speaker of German and a linguist, I know you usually want to preserve the umlaut, but for searches you may want to relax the precision of matching. So, why not do precisely this? If you have "über" or "ueber" in your query, replace it by "über OR ueber". And if you want to take care of those Americans who believe these two dots do not carry any meaning at all (heavy grin at this point), you may add even "OR uber". Syntactically, "uber" is wrong. This would only be a convenience rule for users thinking they can simply omit umlaut dots or who are incapable of typing umlaut characters on their keyboards.

Note: when it comes to German last names, the names Ganser, Gänser and Gaenser would be considered three entirely different names, although the alternative spelling (e.g., in plain e-mail addresses) of Gänser could be Gaenser. Mapping umlauts will get you false positives.

Also be careful with the reverse. "ue", "oe" and "ae" cannot simply be spelled as "ü", "ö" or "ä". In a word like "Zooeingang" (zoo entrance), the composite is actually made of "Zoo" and "Eingang", so the "oe" must not be interpreted as "ö".

Similar issues exist with "ß" and "ss".

Well, most likely these funny cases won't matter too much, so I suggest to try with a simple disjunctive expansion for a start.

Best regards,
--Jürgen

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="OzBEvtx9ibMJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kresimi...@...> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="OzBEvtx9ibMJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="OzBEvtx9ibMJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZsUPgHpwYwruOc%3DLhhrb2JnEG5CWS5O4Nuj52vnty9yPA%40mail.gmail.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: <a href="javascript:" target="_blank" gdf-obfuscated-mailto="OzBEvtx9ibMJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">juergen...@..., URL: <a href="http://www.devoteam.de/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;">www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec79cc5f-a6e1-4fc4-8f60-7f1ab31b60ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Itamar Syn-Hershko
In reply to this post by Krešimir Slugan
What I'm saying is don't use char_filter, and use the token filters chain to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan <[hidden email]> wrote:
Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve original with ASCIIfolding but since char_filter is applied before ASCIIfolding then there would not be any umlauts to fold :) If I could apply char_filter on the end that would be ok, or preserve original with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <[hidden email]> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZtC14VTg4FQcVv5iiyuuO_nWcR4sbDGScqw9Mj5gsRWPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
Which token filter can I use to replace words like über with ueber

On Saturday, November 29, 2014 8:16:14 PM UTC+1, Itamar Syn-Hershko wrote:
What I'm saying is don't use char_filter, and use the token filters chain to achieve that

--

Itamar Syn-Hershko
<a href="http://code972.com/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fcode972.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH_0ahlIREvy79st9arcLSClMBpEw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fcode972.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH_0ahlIREvy79st9arcLSClMBpEw';return true;">http://code972.com | <a href="https://twitter.com/synhershko" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fsynhershko\46sa\75D\46sntz\0751\46usg\75AFQjCNGBL9AV5Pm4wDx-6dKwWnd_Vfn1gQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fsynhershko\46sa\75D\46sntz\0751\46usg\75AFQjCNGBL9AV5Pm4wDx-6dKwWnd_Vfn1gQ';return true;">@synhershko
Freelance Developer & Consultant
Author of <a href="http://manning.com/synhershko/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fsynhershko%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNEtS9J7IelY2CGG_5cda5-SPQNhpQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fsynhershko%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNEtS9J7IelY2CGG_5cda5-SPQNhpQ';return true;">RavenDB in Action

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="mWxTrhJXcbIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kresimi...@...> wrote:
Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve original with ASCIIfolding but since char_filter is applied before ASCIIfolding then there would not be any umlauts to fold :) If I could apply char_filter on the end that would be ok, or preserve original with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:
You may find the approach I give in the end of this talk helpful: <a href="https://skillsmatter.com/skillscasts/4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fskillsmatter.com%2Fskillscasts%2F4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene\46sa\75D\46sntz\0751\46usg\75AFQjCNHzH65Npvm98QI3NdzNksrzc8fQfA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fskillsmatter.com%2Fskillscasts%2F4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene\46sa\75D\46sntz\0751\46usg\75AFQjCNHzH65Npvm98QI3NdzNksrzc8fQfA';return true;">https://skillsmatter.com/skillscasts/4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene

--

Itamar Syn-Hershko
<a href="http://code972.com/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fcode972.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH_0ahlIREvy79st9arcLSClMBpEw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fcode972.com%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNH_0ahlIREvy79st9arcLSClMBpEw';return true;">http://code972.com | <a href="https://twitter.com/synhershko" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fsynhershko\46sa\75D\46sntz\0751\46usg\75AFQjCNGBL9AV5Pm4wDx-6dKwWnd_Vfn1gQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Ftwitter.com%2Fsynhershko\46sa\75D\46sntz\0751\46usg\75AFQjCNGBL9AV5Pm4wDx-6dKwWnd_Vfn1gQ';return true;">@synhershko
Freelance Developer & Consultant
Author of <a href="http://manning.com/synhershko/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fsynhershko%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNEtS9J7IelY2CGG_5cda5-SPQNhpQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fmanning.com%2Fsynhershko%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNEtS9J7IelY2CGG_5cda5-SPQNhpQ';return true;">RavenDB in Action

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <[hidden email]> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="mWxTrhJXcbIJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Itamar Syn-Hershko
Why do you need it as ueber? what I'm usually doing is end up with [über, uber] at the same position, possibly marking the first as being the original. Seeing Jurgen's response, I seem to be on the right path...

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Sat, Nov 29, 2014 at 9:21 PM, Krešimir Slugan <[hidden email]> wrote:
Which token filter can I use to replace words like über with ueber

On Saturday, November 29, 2014 8:16:14 PM UTC+1, Itamar Syn-Hershko wrote:
What I'm saying is don't use char_filter, and use the token filters chain to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan <[hidden email]> wrote:
Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve original with ASCIIfolding but since char_filter is applied before ASCIIfolding then there would not be any umlauts to fold :) If I could apply char_filter on the end that would be ok, or preserve original with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <[hidden email]> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuvKNq58xryBXJ5FLewOafWd0LvsaTADh%2BeYCtHGaRK2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.

On Sat, Nov 29, 2014 at 8:29 PM, Itamar Syn-Hershko <[hidden email]> wrote:
Why do you need it as ueber? what I'm usually doing is end up with [über, uber] at the same position, possibly marking the first as being the original. Seeing Jurgen's response, I seem to be on the right path...

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Sat, Nov 29, 2014 at 9:21 PM, Krešimir Slugan <[hidden email]> wrote:
Which token filter can I use to replace words like über with ueber

On Saturday, November 29, 2014 8:16:14 PM UTC+1, Itamar Syn-Hershko wrote:
What I'm saying is don't use char_filter, and use the token filters chain to achieve that

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Sat, Nov 29, 2014 at 9:02 PM, Krešimir Slugan <[hidden email]> wrote:
Hi Itamar,

I don't think this solves my problem. I'm aware that you can preserve original with ASCIIfolding but since char_filter is applied before ASCIIfolding then there would not be any umlauts to fold :) If I could apply char_filter on the end that would be ok, or preserve original with char_filter.

Best,

Kresimir

On Saturday, November 29, 2014 5:41:11 PM UTC+1, Itamar Syn-Hershko wrote:

--

Itamar Syn-Hershko
http://code972.com | @synhershko
Freelance Developer & Consultant

On Tue, Nov 18, 2014 at 12:30 PM, Krešimir Slugan <[hidden email]> wrote:
 Hi,

To handle German language in search I have to be able to provide same results if user searches for e.g  über, uber or ueber

I would do this at the index time where I would have über in the data.  But if I use just asciifolding filter I lose information that this was work with "umlaut" and I can't get ueber token. If I use char_fiter, it is applied before analysis and I would not be able to get uber. 

Is it possible to preserve original with char filter or apply it after the analysis?

Cheers,

Kresimir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f18f94bc-58e0-4bbf-a445-b45ba4db11f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d362cd4-21a4-486c-bf57-f2de5949f072%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e3cc964-59fc-4be7-bb13-b1411a312ade%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/B-JO9993Avo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuvKNq58xryBXJ5FLewOafWd0LvsaTADh%2BeYCtHGaRK2A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAK4NRa%3DeXOeYcANXC71qvXLyK8RG%3D4L5ijbNXPO9bwdig3yD%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

"Jürgen Wagner (DVT)"
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/547A4766.50103%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.

juergen_wagner.vcf (388 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: <a href="javascript:" target="_blank" gdf-obfuscated-mailto="xDVBIwxY3eAJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">juergen...@..., URL: <a href="http://www.devoteam.de/" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;">www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

joergprante@gmail.com
Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
    "index" : {
        "analysis" : {
            "filter" : {
                "german_normalize_stem" : {
                  "type" : "snowball",
                  "name" : "German2"
                }
            },
            "analyzer" : {
                "stemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize_stem"
                    ]
                },
                "unstemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize"
                    ]
                }
            }
        }
    }
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in words.

You can choose between stemmed and unstemmed analysis. Snowball tends to overstem words. The "german_normalize" token filter is copied from Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,  Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg


On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[hidden email]> wrote:
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: <a href="tel:%2B49%206151%20868-8725" value="+4961518688725" target="_blank">+49 6151 868-8725, Fax: <a href="tel:%2B49%20711%2013353-53" value="+497111335353" target="_blank">+49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFQyXjBrDwBbq44xHYn6aXCkGADMUfGDyutJLjNoLrWYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Beatrix Willius
Hi,

I'm in the preliminary stages for implementing Elasticsearch so I'm interested in this, too.

What about mixed languages or where I even don't know the language? My data are emails. Therefore the data could be any language.

On 30.11.2014, at 17:20, [hidden email] wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

Mit freundlichen Grüßen/Regards

Trixi Willius

http://www.mothsoftware.com
Mail Archiver X: The email archiving solution for professionals

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CF22AA2A-90A9-4B64-9115-9193E9C01C9E%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

joergprante@gmail.com
Hi,

by using my langdetect plugin, and the analyzer-by-path selection of Elasticsearch, it is possible to analyze input by detected language.

See



With "copy_to", the many analyzed fields could be merged together into a general field (similar to _all) for convenient search.

Jörg

On Sun, Nov 30, 2014 at 5:49 PM, Beatrix Willius <[hidden email]> wrote:
Hi,

I'm in the preliminary stages for implementing Elasticsearch so I'm interested in this, too.

What about mixed languages or where I even don't know the language? My data are emails. Therefore the data could be any language.

On 30.11.2014, at 17:20, [hidden email] wrote:

Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

Mit freundlichen Grüßen/Regards

Trixi Willius

http://www.mothsoftware.com
Mail Archiver X: The email archiving solution for professionals

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CF22AA2A-90A9-4B64-9115-9193E9C01C9E%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH82ffbPSbgCxTP3js4MoWEEQJOyqZOafYmWiExq1%3DUQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Andrej
In reply to this post by joergprante@gmail.com
Hello Jörg,

could you maybe share the configuration for the german_normalize analyzer without stemming? I actually only need the umlaut expansion. And what do you mean by "at the right places in words" for snowball?

Thanks!
Andrej

Am Sonntag, 30. November 2014 17:20:16 UTC+1 schrieb Jörg Prante:
Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
    "index" : {
        "analysis" : {
            "filter" : {
                "german_normalize_stem" : {
                  "type" : "snowball",
                  "name" : "German2"
                }
            },
            "analyzer" : {
                "stemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize_stem"
                    ]
                },
                "unstemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize"
                    ]
                }
            }
        }
    }
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in words.

You can choose between stemmed and unstemmed analysis. Snowball tends to overstem words. The "german_normalize" token filter is copied from Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,  Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg


On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="grB2-uNbt8sJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kresimi...@...> wrote:
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: <a href="http://www.devoteam.de/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;">www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="grB2-uNbt8sJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b7484e8-5752-4bf4-878f-342abadbc5d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
In reply to this post by joergprante@gmail.com

Where is this "german_normalize" filter coming from? It solves my problem completely and magically but it's not documented anywhere (and seems like it's not part of ICU plugin either). 

 
What is also weird is that filter can not be used in global context, e.g. it's not possible to try something like this: 
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'

but it is possible to use it in index context:
curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'


In first case I get "ElasticsearchIllegalArgumentException[failed to find global token filter under [german_normalize]]"


On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
    "index" : {
        "analysis" : {
            "filter" : {
                "german_normalize_stem" : {
                  "type" : "snowball",
                  "name" : "German2"
                }
            },
            "analyzer" : {
                "stemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize_stem"
                    ]
                },
                "unstemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize"
                    ]
                }
            }
        }
    }
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in words.

You can choose between stemmed and unstemmed analysis. Snowball tends to overstem words. The "german_normalize" token filter is copied from Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,  Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg


On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="grB2-uNbt8sJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kresimi...@...> wrote:
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: <a href="http://www.devoteam.de/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;">www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="grB2-uNbt8sJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

joergprante@gmail.com
Use "german_normalization"


Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <[hidden email]> wrote:

Where is this "german_normalize" filter coming from? It solves my problem completely and magically but it's not documented anywhere (and seems like it's not part of ICU plugin either). 

 
What is also weird is that filter can not be used in global context, e.g. it's not possible to try something like this: 
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'

but it is possible to use it in index context:
curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'


In first case I get "ElasticsearchIllegalArgumentException[failed to find global token filter under [german_normalize]]"


On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
    "index" : {
        "analysis" : {
            "filter" : {
                "german_normalize_stem" : {
                  "type" : "snowball",
                  "name" : "German2"
                }
            },
            "analyzer" : {
                "stemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize_stem"
                    ]
                },
                "unstemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize"
                    ]
                }
            }
        }
    }
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in words.

You can choose between stemmed and unstemmed analysis. Snowball tends to overstem words. The "german_normalize" token filter is copied from Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,  Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg


On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[hidden email]> wrote:
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEAM7q2c5Xe%3DMRyWwiy73rnB5ur--8xzF1BXDg-m9kQYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

Krešimir Slugan
Thanks!

I assume that "german_normalize" is also part of Decompounder Analysis Plugin ( https://github.com/jprante/elasticsearch-analysis-decompound ) since that is the only analysis plugin we have installed?

Btw. "german_normalization" doesn't seems to be available for our ES version (1.2), would you recommend upgrading instead of using  "german_normalize"?

Best,

Kresimir

On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote:
Use "german_normalization"

"german_normalize" is the same filter I implemented in my plugin <a href="https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fjprante%2Felasticsearch-analysis-german%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fxbib%2Felasticsearch%2Findex%2Fanalysis%2Fgerman%2FGermanAnalysisBinderProcessor.java\46sa\75D\46sntz\0751\46usg\75AFQjCNGUDFkwlt_8CnW9w62eX9WVYUedBA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fjprante%2Felasticsearch-analysis-german%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fxbib%2Felasticsearch%2Findex%2Fanalysis%2Fgerman%2FGermanAnalysisBinderProcessor.java\46sa\75D\46sntz\0751\46usg\75AFQjCNGUDFkwlt_8CnW9w62eX9WVYUedBA';return true;">https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java when it was not available in ES core.

Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="PriEHR_amNUJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">kresimi...@...> wrote:

Where is this "german_normalize" filter coming from? It solves my problem completely and magically but it's not documented anywhere (and seems like it's not part of ICU plugin either). 

 
What is also weird is that filter can not be used in global context, e.g. it's not possible to try something like this: 
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'

but it is possible to use it in index context:
curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'


In first case I get "ElasticsearchIllegalArgumentException[failed to find global token filter under [german_normalize]]"


On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
    "index" : {
        "analysis" : {
            "filter" : {
                "german_normalize_stem" : {
                  "type" : "snowball",
                  "name" : "German2"
                }
            },
            "analyzer" : {
                "stemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize_stem"
                    ]
                },
                "unstemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize"
                    ]
                }
            }
        }
    }
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in words.

You can choose between stemmed and unstemmed analysis. Snowball tends to overstem words. The "german_normalize" token filter is copied from Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,  Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg


On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[hidden email]> wrote:
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: <a href="http://www.devoteam.de/" rel="nofollow" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.devoteam.de%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNG3z1ZtjaVQNL1glL5Hi8fxmGaByw';return true;">www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="PriEHR_amNUJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: char_filter for German

joergprante@gmail.com
Yes, please upgrade Elasticsearch to use the official german normalizer.

I added it to decompound plugin for convenience, it may be removed at any later time.

Jörg

On Wed, Mar 11, 2015 at 9:54 PM, Krešimir Slugan <[hidden email]> wrote:
Thanks!

I assume that "german_normalize" is also part of Decompounder Analysis Plugin ( https://github.com/jprante/elasticsearch-analysis-decompound ) since that is the only analysis plugin we have installed?

Btw. "german_normalization" doesn't seems to be available for our ES version (1.2), would you recommend upgrading instead of using  "german_normalize"?

Best,

Kresimir

On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote:
Use "german_normalization"


Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <[hidden email]> wrote:

Where is this "german_normalize" filter coming from? It solves my problem completely and magically but it's not documented anywhere (and seems like it's not part of ICU plugin either). 

 
What is also weird is that filter can not be used in global context, e.g. it's not possible to try something like this: 
curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'

but it is possible to use it in index context:
curl -XGET 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' -d 'this is a test'


In first case I get "ElasticsearchIllegalArgumentException[failed to find global token filter under [german_normalize]]"


On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
Do not use regex, this will give wrong results.

Elasticsearch comes with full support for german umlaut handling.

If you install ICU plugin, you can use something like this analysis setting

{
    "index" : {
        "analysis" : {
            "filter" : {
                "german_normalize_stem" : {
                  "type" : "snowball",
                  "name" : "German2"
                }
            },
            "analyzer" : {
                "stemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize_stem"
                    ]
                },
                "unstemmed" : {
                    "type" : "custom",
                    "tokenizer" : "standard",
                    "filter" : [
                        "lowercase",
                        "icu_normalizer",
                        "icu_folding",
                        "german_normalize"
                    ]
                }
            }
        }
    }
}

ICU handles german umlauts, and also case folding like "ss" and "ß".

Snowball handles umlaut expansions (ae, oe, ue) at the right places in words.

You can choose between stemmed and unstemmed analysis. Snowball tends to overstem words. The "german_normalize" token filter is copied from Snowball but works without stemming.

The effect of the combination is that all german words like Jörg,  Joerg, Jorg are reduced to jorg in the index.

Best,

Jörg


On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[hidden email]> wrote:
Hi Jürgen,

Currently we don't have big volumes of data to index so we would like to yield more results in hope that proper ones would still be shown in the top. In future, when we have more data, we'll have to sacrifice some use cases in order to provide more precise results for the rest of users. 

I think I will try regexp token approach to replace umlauts with "e" forms to solve this double expansion problem. 

Best,

Krešimir

On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) wrote:
Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern" (change). When you cannot write umlauts, the correct alternative spelling in print is "ueber", "hoeren", "aendern". Everybody can write this in ASCII. However, those who are possibly non-speakers of German who still want to search for German terms are usually not aware of this and believe it's like with accents in French, where "á" is lexically treated like "a". Those users are wrong in spelling "uber", "horen", "andern" because "u" and "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you could decide that - to yield at least some meaningful results - you will also consider the versions without the umlaut dots equivalent. In that case, you want to map any token containing an umlaut (ä, ö, ü) to three alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. This won't let you distinguish between the "Bar" (bar, the place to get a drink) and "Bär" (bear, the one giving you a great, dangerous hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation, promotion, extraction [geol.]) are also quite different, just to give a few examples.

For the proper recognition of those terms, you would normally use a dictionary of German, including some frequent proper names as well. So, if you look for "clown boll", you would not only get "Der Clown im Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines Clowns", because the query would be transformed into "clown AND (boll OR boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. If you dare to normalize your indexed texts, so "Boell" would already have been turned into "Böll", you could even do with a disjunction of only the one correct form and the misspelling. Again, however, you would make use of a dictionary to perform such normalization. Ideally, you would even have a POS tagger in place, so you would only make such replacements where the name Böll is referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If you just want to index some German text, maybe you just want to turn all umlauts into the plain vocals for the purpose of indexing, but still keep the reference to the original for result display. Maybe that's sufficient. For larger volumes of documents, a more precise approach is recommended to avoid false positives.

Cheers,
--Jürgen


On 29.11.2014 20:35, Krešimir Slugan wrote:
Because, as far as I understand, in German it's semantically the same to write über or ueber (although ueber is less often used). I guess this is not true only for personal names.
Syntactically, "uber" is wrong but users sometimes search for this also.


--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [hidden email], URL: www.devoteam.de


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFdxomzMhbZT8Grr4c9fUqrb4v0UA9v6EYmxBPBKCf%3D0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.