Searching for "foo" should also find occurrence of "foo.bar"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Searching for "foo" should also find occurrence of "foo.bar"

Marian Steinbach
We have ElasticSearch 1.5 set up with a very simple mapping to perform full text search in our docs (https://docs.giantswarm.io/). When searching for "swarmvars" we get no hits, although "swarmvars.json" appears in documents.

The field "text" is used as a catch-all field for all searchable content (title, document body, keywords). Here is the mapping:

"properties": {
  ...,
  "text": {
    "type": "string",
    "store": true,
    "index": "analyzed",
    "term_vector": "with_positions_offsets",
    "analyzer": "english",
  }
}

When using the "english" analyzer on the text "Text containing swarmvars.json and more", the result are these tokens:

text
contain
swarmvars.json
more

Having the token "swarmvars.json" is fine. What I need are two additional tokens "swarmvars" and "json". How can I achieve that?

I was looking into creating a custom tokenizer, but I was unable to get it to work (errors when applying the settings) and also I was unable to find an example, no matter how I searched.

Thanks!


--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85a03096-ae33-4517-8eab-6f2be4da73ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searching for "foo" should also find occurrence of "foo.bar"

dadoonet
I would probably go with a Pattern Tokenizer and define whatever regex you need.

The standard one is more for english text which means that a dot need to have a space after it in order to be considered as a break between two tokens.

Make sense?

-- 
David Pilato - Developer | Evangelist 





Le 29 mai 2015 à 09:39, Marian Steinbach <[hidden email]> a écrit :

We have ElasticSearch 1.5 set up with a very simple mapping to perform full text search in our docs (https://docs.giantswarm.io/). When searching for "swarmvars" we get no hits, although "swarmvars.json" appears in documents.

The field "text" is used as a catch-all field for all searchable content (title, document body, keywords). Here is the mapping:

"properties": {
  ...,
  "text": {
    "type": "string",
    "store": true,
    "index": "analyzed",
    "term_vector": "with_positions_offsets",
    "analyzer": "english",
  }
}

When using the "english" analyzer on the text "Text containing swarmvars.json and more", the result are these tokens:

text
contain
swarmvars.json
more

Having the token "swarmvars.json" is fine. What I need are two additional tokens "swarmvars" and "json". How can I achieve that?

I was looking into creating a custom tokenizer, but I was unable to get it to work (errors when applying the settings) and also I was unable to find an example, no matter how I searched.

Thanks!



--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85a03096-ae33-4517-8eab-6f2be4da73ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8B2107E8-4E9A-47FA-BFE0-FE36FF9FBF1C%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searching for "foo" should also find occurrence of "foo.bar"

Marian Steinbach
Thanks for the reply! However, it doesn't make sense to me directly.

If I use the dot as an additional seperator, I will end up with the tokens "swarmvars" and "json", but not "swarmvars.json". Right?


Am Freitag, 29. Mai 2015 10:47:56 UTC+2 schrieb David Pilato:
I would probably go with a Pattern Tokenizer and define whatever regex you need.
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-pattern-tokenizer.html\46sa\75D\46sntz\0751\46usg\75AFQjCNH-A3CGYp_smtBuey6eGx-nX6U5YQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-pattern-tokenizer.html\46sa\75D\46sntz\0751\46usg\75AFQjCNH-A3CGYp_smtBuey6eGx-nX6U5YQ';return true;">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

The standard one is more for english text which means that a dot need to have a space after it in order to be considered as a break between two tokens.

Make sense?

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44d85c90-acad-43b9-a082-6343395f19c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searching for "foo" should also find occurrence of "foo.bar"

dadoonet
Yes. Because « Hello. How are you? » is a sentence that can be broken in « hello », « how », « are », « you ».
But in « I paid it 2.50 euros », I would most likely keep « 2.50 »  as a whole token.

-- 
David Pilato - Developer | Evangelist 





Le 29 mai 2015 à 10:59, Marian Steinbach <[hidden email]> a écrit :

Thanks for the reply! However, it doesn't make sense to me directly.

If I use the dot as an additional seperator, I will end up with the tokens "swarmvars" and "json", but not "swarmvars.json". Right?


Am Freitag, 29. Mai 2015 10:47:56 UTC+2 schrieb David Pilato:
I would probably go with a Pattern Tokenizer and define whatever regex you need.
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-pattern-tokenizer.html\46sa\75D\46sntz\0751\46usg\75AFQjCNH-A3CGYp_smtBuey6eGx-nX6U5YQ';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fwww.elastic.co%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-pattern-tokenizer.html\46sa\75D\46sntz\0751\46usg\75AFQjCNH-A3CGYp_smtBuey6eGx-nX6U5YQ';return true;" class="">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

The standard one is more for english text which means that a dot need to have a space after it in order to be considered as a break between two tokens.

Make sense?


--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44d85c90-acad-43b9-a082-6343395f19c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DD60ACAE-9659-43F1-AF10-6517D0D79DEF%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searching for "foo" should also find occurrence of "foo.bar"

Marian Steinbach


Am Freitag, 29. Mai 2015 11:02:25 UTC+2 schrieb David Pilato:
Yes. Because « Hello. How are you? » is a sentence that can be broken in « hello », « how », « are », « you ».
But in « I paid it 2.50 euros », I would most likely keep « 2.50 »  as a whole token.


So far, so easy. And my question is now: From a text "foo.bar", how can I generate ALL of the following tokens?

foo
bar
foo.bar
 

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7acf73b-1e15-431c-bee7-1b5f726fb69d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Searching for "foo" should also find occurrence of "foo.bar"

dadoonet
I would use 2 analyzers and multi field: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#_multi_fields_3

-- 
David Pilato - Developer | Evangelist 





Le 29 mai 2015 à 11:11, Marian Steinbach <[hidden email]> a écrit :



Am Freitag, 29. Mai 2015 11:02:25 UTC+2 schrieb David Pilato:
Yes. Because « Hello. How are you? » is a sentence that can be broken in « hello », « how », « are », « you ».
But in « I paid it 2.50 euros », I would most likely keep « 2.50 »  as a whole token.


So far, so easy. And my question is now: From a text "foo.bar", how can I generate ALL of the following tokens?

foo
bar
foo.bar
 

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7acf73b-1e15-431c-bee7-1b5f726fb69d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8F8EF922-9156-4DD9-98DA-1D1B5ECF3929%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.