Specifying analyzer on a per field basis at index time

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Specifying analyzer on a per field basis at index time

barnybug
I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby
Reply | Threaded
Open this post in threaded view
|

Re: Specifying analyzer on a per field basis at index time

kimchy
Administrator
No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

Reply | Threaded
Open this post in threaded view
|

Re: Specifying analyzer on a per field basis at index time

barnybug
Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed and unstemmed (eg. english analyzer to produce stemmed English and 'standard' analyzer to produce unstemmed). So using a multi_field seems applicable, but then the two analyzers are fixed. Kind of need to specify two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language specific) searches and unstemmed (general) searches. This comes down to a requirement to be able to match names, proper nouns, etc in cases where stemming may interfere but there's no definitive list of these terms that should not be stemmed.

We considered an index per language but it's quite a high number of languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer that wrapped existing stemming tokenizers but also produced the original term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:
No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby

Reply | Threaded
Open this post in threaded view
|

Re: Specifying analyzer on a per field basis at index time

kimchy
Administrator
It makes sense, the problem with using different analyzers on the same field is that all those tokens, from the different languages, end up under the same field, so its "kindda dirty". How about using a single field called x using the standard analyzer, and x_[langId] for each language? You can use dynamic mapping to automatically map analysis parameters for *_en, or *_de (and so on, see more here under dynamic templates: http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html).

On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed and unstemmed (eg. english analyzer to produce stemmed English and 'standard' analyzer to produce unstemmed). So using a multi_field seems applicable, but then the two analyzers are fixed. Kind of need to specify two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language specific) searches and unstemmed (general) searches. This comes down to a requirement to be able to match names, proper nouns, etc in cases where stemming may interfere but there's no definitive list of these terms that should not be stemmed.

We considered an index per language but it's quite a high number of languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer that wrapped existing stemming tokenizers but also produced the original term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:
No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby


Reply | Threaded
Open this post in threaded view
|

Re: Specifying analyzer on a per field basis at index time

barnybug
Good plan, thanks for suggestion.

Barnaby

On Wednesday, 7 March 2012 11:28:17 UTC, kimchy wrote:
It makes sense, the problem with using different analyzers on the same field is that all those tokens, from the different languages, end up under the same field, so its "kindda dirty". How about using a single field called x using the standard analyzer, and x_[langId] for each language? You can use dynamic mapping to automatically map analysis parameters for *_en, or *_de (and so on, see more here under dynamic templates: http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html).

On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed and unstemmed (eg. english analyzer to produce stemmed English and 'standard' analyzer to produce unstemmed). So using a multi_field seems applicable, but then the two analyzers are fixed. Kind of need to specify two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language specific) searches and unstemmed (general) searches. This comes down to a requirement to be able to match names, proper nouns, etc in cases where stemming may interfere but there's no definitive list of these terms that should not be stemmed.

We considered an index per language but it's quite a high number of languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer that wrapped existing stemming tokenizers but also produced the original term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:
No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby


Reply | Threaded
Open this post in threaded view
|

Re: Specifying analyzer on a per field basis at index time

Sapana Patel
In reply to this post by kimchy
Hi,

I am facing the same problem but not able to decide which option to use.
I have one Document having id,name,description,datetime,userid fields. 
From these all fields only 2 fields name,description can be in any languages english, german etc.

Can you please explain following sentence with example? or suggest me what approach I will follow for better performance?

How about using a single field called x using the standard analyzer, and x_[langId] for each language? You can use dynamic mapping to automatically map analysis parameters for *_en, or *_de  etc.

Please give an example for automatically map analysis. 

I have to use Java API for this. So is it possible with Java API?

--
Thanks
Sapana

On Wednesday, March 7, 2012 4:58:17 PM UTC+5:30, kimchy wrote:
It makes sense, the problem with using different analyzers on the same field is that all those tokens, from the different languages, end up under the same field, so its "kindda dirty". How about using a single field called x using the standard analyzer, and x_[langId] for each language? You can use dynamic mapping to automatically map analysis parameters for *_en, or *_de (and so on, see more here under dynamic templates: http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html).

On Tuesday, March 6, 2012 at 11:17 PM, barnybug wrote:

Hi,

Thanks for the response.

Currently we're indexing a set of documents in different languages and using _analyzer mapping to determine the per doc stemming analyzer.

What we'd like to do is index some fields of the documents both stemmed and unstemmed (eg. english analyzer to produce stemmed English and 'standard' analyzer to produce unstemmed). So using a multi_field seems applicable, but then the two analyzers are fixed. Kind of need to specify two _analyzer fields.

Essentially the customer wants to be able to do both stemmed (language specific) searches and unstemmed (general) searches. This comes down to a requirement to be able to match names, proper nouns, etc in cases where stemming may interfere but there's no definitive list of these terms that should not be stemmed.

We considered an index per language but it's quite a high number of languages we're dealing so would likely be too many indexes.
Using a field per language also presents issues - to do the general unstemmed searches would require querying across many fields.

Alternatively we were considering if it'd be easy to develop a tokenizer that wrapped existing stemming tokenizers but also produced the original term in addition to the stemmed term.

Sorry if that makes less than perfect sense!

thanks,

Barnaby

On Tuesday, 6 March 2012 20:29:57 UTC, kimchy wrote:
No, you can't specify it per field, though why do you want it? Usually, having a different analyzer for each document does't make a lot of sense. Usually, it makes more sense to have different fields.

On Tuesday, March 6, 2012 at 6:01 PM, barnybug wrote:

I understand you can specify the analyzer per document at index time
using the _analyzer field in mapping, but is it possible to specify it
in the same way but per field at index time?

Or if not currently possible, how easy to add (happy to have a crack
at it myself)?

thanks

Barnaby


--