Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Prashant Agrawal
Hi All,

We are having an ES cluster which is used to index large amount of data and that too with different languages. So as of now our current settings was pointing to English analyzer, and English hunspell but how we can achieve to index multilingual data along with Multi lingual analyzer and hunspell setup for same index (as I came across like there is a plugin called Elasticsearch Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect) available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]  
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then index data by configuring lang detect plugin for specific fields. So here whether data will be indexed and analyzed as per the language analyzer mentioned? And also will it be searchable as per multiple hunspell dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US,hunspell_IN,hindi,english]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN,hindi,english]  
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true

                       
After that Say, I have configured lang detect plugin and indexed some data with different language English and hindi. So as I have configured multiple language analyzer, MultiLingual hunspell so will I be able to perform the index and search wrt different language as with different analyzer and get the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Prashant Agrawal
Hi Jorg,

Can you help me out in this as I found you are the owner of lang detect plugin.

~Prashant
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

joergprante@gmail.com
In reply to this post by Prashant Agrawal
With langdetect plugin, there is ujst a field "lang" mapped under the string field that is used for detection, and in this field the languages codes are written. This is useful for e.g. aggregations or filtering documents by language.

At the moment it is not possible to use something like for example a dynamic "copy_to" to duplicate the field after detection to a field with language-specific analyzer like a synonym analyzer.

A feature request at the issue tracker at github is much appreciated so I can have a look into this. 

Jörg


On Wed, Sep 24, 2014 at 12:57 PM, Prashant Agrawal <[hidden email]> wrote:
Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve to
index multilingual data along with Multi lingual analyzer and hunspell setup
for same index (as I came across like there is a plugin called Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard,
lowercase,hunspell_US,hunspell_IN,hindi,english]
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true


After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1411556233114-4063950.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGZDhBwBRhNdF42QZg2wiJPTwzsryCdFMLc%2BfkBesZNeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

joergprante@gmail.com

On Wed, Sep 24, 2014 at 5:55 PM, [hidden email] <[hidden email]> wrote:
With langdetect plugin, there is ujst a field "lang" mapped under the string field that is used for detection, and in this field the languages codes are written. This is useful for e.g. aggregations or filtering documents by language.

At the moment it is not possible to use something like for example a dynamic "copy_to" to duplicate the field after detection to a field with language-specific analyzer like a synonym analyzer.

A feature request at the issue tracker at github is much appreciated so I can have a look into this. 

Jörg


On Wed, Sep 24, 2014 at 12:57 PM, Prashant Agrawal <[hidden email]> wrote:
Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve to
index multilingual data along with Multi lingual analyzer and hunspell setup
for same index (as I came across like there is a plugin called Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard,
lowercase,hunspell_US,hunspell_IN,hindi,english]
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true


After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1411556233114-4063950.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGEtcKLiA5oR6BRSvxdGGB7qmXx5iFoY-2dw2RHg%2BrX4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Nitin Maheshwari
In reply to this post by Prashant Agrawal
You can use langdetect plugin to identify the language of the document, and use that document path to set _analyzer. _analyzer can be set dynamically in that way, so the languages which are detected, analyzers with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}



On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve to
index multilingual data along with Multi lingual analyzer and hunspell setup
for same index (as I came across like there is a plugin called Elasticsearch
Langdetect Plugin (<a href="https://github.com/jprante/elasticsearch-langdetect" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fjprante%2Felasticsearch-langdetect\46sa\75D\46sntz\0751\46usg\75AFQjCNFr-1UTXz6b3XSJ-q0ZcV_QfuW-_Q';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fjprante%2Felasticsearch-langdetect\46sa\75D\46sntz\0751\46usg\75AFQjCNFr-1UTXz6b3XSJ-q0ZcV_QfuW-_Q';return true;">https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]  
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard,
lowercase,hunspell_US,hunspell_IN,hindi,english]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]  
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true

                        
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant



--
View this message in context: <a href="http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FUse-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5t9KWx4YtX81KktP_QQWlJc21Zw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FUse-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF5t9KWx4YtX81KktP_QQWlJc21Zw';return true;">http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

joergprante@gmail.com
Wow, this works? Surprise....

Jörg

On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari <[hidden email]> wrote:
You can use langdetect plugin to identify the language of the document, and use that document path to set _analyzer. _analyzer can be set dynamically in that way, so the languages which are detected, analyzers with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}



On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve to
index multilingual data along with Multi lingual analyzer and hunspell setup
for same index (as I came across like there is a plugin called Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]  
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard,
lowercase,hunspell_US,hunspell_IN,hindi,english]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]  
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true

                        
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Nitin Maheshwari
Yes, it works... :)

But I am dealing with a different problem... I have a data set of around 200,000 records, It detected wrong language for 8000 records, may be the text size was small. For most wrong case of lang detection it returned af. 

Indexing that document is failing if the detected language based analyzer is  not defined in the system. I am trying to find how can i set a default analyzer is the analyzer discovered does not exist.

On Thu, Sep 25, 2014 at 1:53 PM, [hidden email] <[hidden email]> wrote:
Wow, this works? Surprise....

Jörg

On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari <[hidden email]> wrote:
You can use langdetect plugin to identify the language of the document, and use that document path to set _analyzer. _analyzer can be set dynamically in that way, so the languages which are detected, analyzers with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}



On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve to
index multilingual data along with Multi lingual analyzer and hunspell setup
for same index (as I came across like there is a plugin called Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]  
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard,
lowercase,hunspell_US,hunspell_IN,hindi,english]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]  
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true

                        
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/BYE_y-2ni9I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHDjFTE1XS4DAkLvN0KegCifzR92JgL_h2VZMHYt27FVQmJBnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

joergprante@gmail.com
Sure. In next release, I will add a new parameters, so there will be better control:

- the languages that can be detected (default: all)

- if there is more than one language detected, how many language codes are indexed

- threshold levels for successful detection

- and, how many words must be in a field before detection is executed (or field length in characters). The number of words should be at least 3, otherwise detection is close to random.

Jörg

On Thu, Sep 25, 2014 at 10:36 AM, Nitin Maheshwari <[hidden email]> wrote:
Yes, it works... :)

But I am dealing with a different problem... I have a data set of around 200,000 records, It detected wrong language for 8000 records, may be the text size was small. For most wrong case of lang detection it returned af. 

Indexing that document is failing if the detected language based analyzer is  not defined in the system. I am trying to find how can i set a default analyzer is the analyzer discovered does not exist.

On Thu, Sep 25, 2014 at 1:53 PM, [hidden email] <[hidden email]> wrote:
Wow, this works? Surprise....

Jörg

On Thu, Sep 25, 2014 at 10:20 AM, Nitin Maheshwari <[hidden email]> wrote:
You can use langdetect plugin to identify the language of the document, and use that document path to set _analyzer. _analyzer can be set dynamically in that way, so the languages which are detected, analyzers with those names should be existing in the system.

"my_index" : {
"_analyzer" : {
"path" : "lang_detect_field.lang"
},
"properties" : {
"lang_detect_field" : {
"type" : "langdetect",
"fields" : {
"lang_detect" : {
"type" : "string"
},
"lang" : {
"type" : "string"
}
}
}
}



On Wednesday, 24 September 2014 16:27:23 UTC+5:30, Prashy wrote:
Hi All,

We are having an ES cluster which is used to index large amount of data and
that too with different languages. So as of now our current settings was
pointing to English analyzer, and English hunspell but how we can achieve to
index multilingual data along with Multi lingual analyzer and hunspell setup
for same index (as I came across like there is a plugin called Elasticsearch
Langdetect Plugin (https://github.com/jprante/elasticsearch-langdetect)
available from ES 1.2.1).

Current analyzer setting is like:
index :
  analysis :
      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US]  
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true


So here,
1) Can we configure multilingual analyzer, hunspell for same index and then
index data by configuring lang detect plugin for specific fields. So here
whether data will be indexed and analyzed as per the language analyzer
mentioned? And also will it be searchable as per multiple hunspell
dictionaries and synonyms configured as well:

Confirm if below settings can be ued to achieve the same:

      analyzer :
        synonym :
            tokenizer : whitespace
            filter : [synonym]
        default_index :
            type : custom
            tokenizer : whitespace
            filter : [ standard,
lowercase,hunspell_US,hunspell_IN,hindi,english]  
        default_search :
            type : custom
            tokenizer : whitespace
            filter : [standard, lowercase,
synonym,hunspell_US,hunspell_IN,hindi,english]  
      filter :
        hindi:
          tokenizer:  standard
          filter: [lowercase]
        english:
          tokenizer:  standard
          filter: [lowercase]
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
                        ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
                        ignore_case : true

                        
After that Say, I have configured lang detect plugin and indexed some data
with different language English and hindi. So as I have configured multiple
language analyzer, MultiLingual hunspell so will I be able to perform the
index and search wrt different language as with different analyzer and get
the data as per analyzed tokens for different languages.

Also whether synonym will also work with different languages?

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba265cf2-aaad-48b3-b454-930f9ac8a2cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/BYE_y-2ni9I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFCYHDPF%2BKMa67HUJ5DzWh%3D%2BpUcVAgQat%2B2AEWPb0QECw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHDjFTE1XS4DAkLvN0KegCifzR92JgL_h2VZMHYt27FVQmJBnw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGza763fsvTbmXgY71DyzQEdJPO4nAyHqnuL-QEKXPosw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Prashant Agrawal
In reply to this post by Nitin Maheshwari
Hi Jörg/Nitin,

If I am not wrong then you are setting me to set the analyzer dynamically which I have setup Manually like below:
analyzer :
      synonym :
            tokenizer : whitespace
            filter : [synonym]
          default_index :
                        _analyzer :
                                path : lang_detect_field.lang
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US,hunspell_IN]  
      default_search :
                        _analyzer :
                                path : lang_detect_field.lang
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN]  
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
            ignore_case : true .
                       
Considering I have lang_detect_field in my mapping.

Correct me if I am wrong.

1) Also what if my content is an attachment type, can I still use the same. As the content for indexing will be send as base64 encoded format? If yes how we can configure that as type will be an attachment for the same.

2) If we can not achieve it using langdetect plugin then also can we have multiple language analyzer in our config as mentioned in first post, and will ES be able to recognise the same and perform indexing?

3) @Jorg, as you are owner of hunspell plugin as well, So can you let me know if I can use multiple language hunspell configured for my index setting?

~Prashant
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

joergprante@gmail.com
You can not use "_analyzer", which is a root mapping property, in an analyzer definition.

My Hunspell plugin is kind of stalled, since there is hunspell support in the core code:


To be honest, it was quite an age ago when I was busy with hunspell, so I can not answer your question instantly. I remember the results of hunspell dictionaries used for stems were not satisfying. This was also due to a poor hunspell dictionary reader. I hope the ES core code works better. 

Because hunspell stemming is a token filter, you'd have to create a bunch of custom analyzers with a hunspell token filter per language, and address them via the "_analyzer" path method as shown above.

Jörg

On Thu, Sep 25, 2014 at 11:11 AM, Prashant Agrawal <[hidden email]> wrote:
Hi Jörg/Nitin,

If I am not wrong then you are setting me to set the analyzer dynamically
which I have setup Manually like below:
analyzer :
      synonym :
            tokenizer : whitespace
            filter : [synonym]
          default_index :
                        _analyzer :
                                path : lang_detect_field.lang
            tokenizer : whitespace
            filter : [ standard, lowercase,hunspell_US,hunspell_IN]
      default_search :
                        _analyzer :
                                path : lang_detect_field.lang
            tokenizer : whitespace
            filter : [standard, lowercase, synonym,hunspell_US,hunspell_IN]
      filter :
        synonym :
            type : synonym
            ignore_case : true
            expand : true
            synonyms_path : synonyms.txt
        hunspell_US :
            type : hunspell
            locale : en_US
            dedup : false
            ignore_case : true
        hunspell_IN :
            type : hunspell
            locale : hi_IN
            dedup : false
            ignore_case : true .

Considering I have lang_detect_field in my mapping.

Correct me if I am wrong.

1) Also what if my content is an attachment type, can I still use the same.
As the content for indexing will be send as base64 encoded format? If yes
how we can configure that as type will be an attachment for the same.

2) If we can not achieve it using langdetect plugin then also can we have
multiple language analyzer in our config as mentioned in first post, and
will ES be able to recognise the same and perform indexing?

3) @Jorg, as you are owner of hunspell plugin as well, So can you let me
know if I can use multiple language hunspell configured for my index
setting?

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950p4064005.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1411636317742-4064005.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2wne8VQx0oLJOojqdcxZVmk3_axe5BHcDuPe2%2Bj6NrA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Prashant Agrawal
Hi Jorg,

What about this

1) Also what if my content is an attachment type, can I still use the same. As the content for indexing will be send as base64 encoded format? If yes how we can configure that as type will be an attachment for the same.

2) If we can not achieve it using langdetect plugin then also can we have multiple language analyzer in our config as mentioned in first post, and will ES be able to recognise the same and perform indexing?

3) I hope the ES core code works better.
Here what do you mean by ES core code, is there any specific settings which can be used for grammar based search ?

Also as I am not getting the clear picture (it got mixed somewhere with _analyzer) for analyzer stuff like how we can create multiple (different language) analyzer for same index. It would be great if you can give me a little demonstration which can be overwritten with analyzer setting in my first post (where I have used default_analyzer and default_search) to replace with two analyzer dynamically.

~Prashant
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

joergprante@gmail.com
Langdetect works with binary content, e.g. from attachment mapper.

The other questions can not be answered quick. If I find time, I can post something at gist.github.com

Jörg

On Thu, Sep 25, 2014 at 11:41 AM, Prashant Agrawal <[hidden email]> wrote:
Hi Jorg,

What about this

1) Also what if my content is an attachment type, can I still use the same.
As the content for indexing will be send as base64 encoded format? If yes
how we can configure that as type will be an attachment for the same.

2) If we can not achieve it using langdetect plugin then also can we have
multiple language analyzer in our config as mentioned in first post, and
will ES be able to recognise the same and perform indexing?

3) I hope the ES core code works better.
Here what do you mean by ES core code, is there any specific settings which
can be used for grammar based search ?

Also as I am not getting the clear picture (it got mixed somewhere with
_analyzer) for analyzer stuff like how we can create multiple (different
language) analyzer for same index. It would be great if you can give me a
little demonstration which can be overwritten with analyzer setting in my
first post (where I have used default_analyzer and default_search) to
replace with two analyzer dynamically.

~Prashant



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Use-case-of-multiple-Language-Analyzer-Hunspell-along-with-Elasticsearch-Langdetect-Plugin-tp4063950p4064009.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1411638089417-4064009.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFLgwg2AwQiuiXGD27L3DddhnzGcn5-1Z5mkXJp9zE_6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Use case of multiple Language Analyzer, Hunspell along with Elasticsearch Langdetect Plugin

Prashant Agrawal
Ok no problem.

so can you let me know if you post to any of query in gist.github.com.

Also I hope by using langdetect plugin we can analyze the different language content as well (using _analyzer) so a feature request for "a dynamic "copy_to" to duplicate the field after detection to a field with language-specific analyzer like a synonym analyzer." is not required now right ?