help needed with the query

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

help needed with the query

mfeingold
Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.

It is my understanding  that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?

Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

Clinton Gormley
Hi Michael

> As a user types information about a member I need to provide a list of
> suggestions based on the ...: member first/last name, his member id
> and date of birth.

>
> It is my understanding  that I need to index my membership info (~100M
> records) so that I can use edit distance and metaphone against first
> name/ last name. I also would like to use a list of synonyms for the
> first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

> It also makes sense to use edit
> distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user.  Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

> Date of birth I would like to use as a filter - if it is there I only
> show members matching it.

> Also to further narrow the list I would only include members who live
> within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:
1) preparing your data, and
2) searching

PREPARING YOUR DATA:
--------------------

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

 - first name / last name:
   - these are string fields

   - we want to use synonyms (eg Robert vs Bob)
     http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html

   - we want to include metaphones
     http://www.elasticsearch.org/guide/reference/index-modules/analysis/phonetic-tokenfilter.html

   - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
     edge ngrams: http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenfilter.html

   - and let's also use ascii-folding to allow 'éëè' to match 'e'
     http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter.html

   - we want to do 3 types of matches:
      - most relevant:  full word matches
      - less relevant:  partial word matches (eg with ngrams, synonyms)
      - least relevant: metaphone matches

     so we'll index the names with three versions, as a multi-field
     http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

 - member id
     - you didn't specify if this is an numeric or alphanumeric, so I'll
       just assume alphanumeric, possibly with punctuation, eg
       "ABC-1234"

     - let's say that we want to tokenize this as "abc","1234", so we'll
       use the "simple" analyzer
       http://www.elasticsearch.org/guide/reference/index-modules/analysis/simple-analyzer.html

 - birthday
     - this is just a date field, no analysis needed

 - location
     - this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

  - full_name:
       - standard token filter
       - lowercase
       - ascii folding

  - partial_name:
       - standard token filter
       - lowercase
       - ascii folding
       - synonyms
       - edge ngrams

  - name_metaphone:
       - standard token filter
       - phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping: https://gist.github.com/1088986

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:
1) For first_name/last_name, I am using multi-fields.  The "main"
   sub-field has the same name as the top level, so that if I refer
   to "first_name" it automatically references "first_name.first_name"

   So in effect, I have "first_name" and "first_name.partial" and
   "first_name.metaphone"

2) In the partial name fields, I am using index_analyzer and
   search_analyzer.

   Normally, you want your data and search terms to use the same
   analyzer - this ensures that you are searching for the same
   terms that are actually stored in ES. For example, in the
   first_name.metaphone field, I just specify an 'analyzer'
   (which sets both the search_analyzer and index_analyzer to
   the same value)

   However, for the partial field, we want them to be different.  If we
   store the name "Clinton", we want to be able to use auto-complete
   for search terms like 'clin' (ie partial matches). So at index time,
   we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

   However, when we search, we don't want 'clin' to match 'cat','cliff'
   etc.  So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name' 
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name' 
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone' 

and to check that the ascii folding is working, try 'sánchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyzer=partial_name' 


SEARCHING YOUR DATA
-------------------

Let's get rid of the easy stuff first:

birthday:

        If your user enters a birthday, then you want to filter the
        results to only include members with a matching birthday:
       
           { term: { birthday: '1970-10-24' }}
       
location:

        Use a geo_distance filter to find results within 100km of
        London:
       
           { geo_distance: {
                   distance: "100km",
                   location: [51.50853, -0.12574]
           }}
       
member_id:

     This can use a simple text query:

     { text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

   Show me any name whose first or last name field matches
   completely or partially, but consider full word matches
   to be more relevant than partial matches or metaphone
   matches

We're going to combine these queries using the 'bool' query.  The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
   { should: [
       { text: { "first_name": "rob" }},         # full name
       { text: { "first_name.partial": "rob" }}  # partial match
       { text: { "first_name.metaphone": "rob"}} # metaphone
     ]
}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.  

But lets say that we wanted a full word match to be significantly more
relevant than the other two.  We can change that clause to:

 { text: { "first_name": {
    query: "rob",
    boost: 1
 }}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

1) just the bool query
   { query: { bool: {...}}

   https://gist.github.com/1089180

2) just one or more filters
   { query: { constant_score: { filter: {....} }}
     
   https://gist.github.com/1089206

3) the bool query combined with one or more filters
   { query: { filtered: { query: {bool: ...}, filter: {.....} }}}

   https://gist.github.com/1089201


This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint


Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

mfeingold
Hi Clinton:

Thanks for a quick and detailed response.

To clarify a few points:
1. The synonyms  - I only need the synonyms on the first name, I
think. I do not imagine it to be of much use for last names. I am not
sure if dropping synonyms for the last name would have any impact in
terms of performance or necessary disk space. Based on your templates
I hope I understand how to do this.
2. Names - one of the problems I foresee is stemming from the fact
that I want a single input string with all the search parameters
(except geoloc). The problem is to tell apart the first name from the
last name. Additional complication here is that a name (both first and
last) can consist of several words. My hope was that I can build
indexes/queries in such a way so that I can throw both names as a
single string at the query leaving it to the ES to figure out which
one is which
3. Edge ngrams - I would like to limit wild card search to first let
us say 6 chars, assuming that more than that implies exact match. my
clumsy experiments with the ES made me think that if ngrams are in the
play, they have to be all the way through the max length of the field,
otherwise the search misses exact matches. I was doing something
wrong, I hope
4. Member ID - it is alphanumeric. I still think that some degree of
fuzziness can help here. The idea is to let the user type anything he
knows and provide autosuggest as he types. So if he knows the ID -
great it should be an immediate hit, but if he mistyped it and also
provided last name - it still can let me to make a pretty well
educated guess. So why do not do it? I think though that the edit
distance allowed here should be minimal.
5. Geo - I am curious about the performance of this. Does it really
does sqrt from sum of squares? Does it mean that during this process
it actually loops through all documents to find the matching ones? The
reason I am asking is that I do not insist on the circle - I can get
away with a square, I mean instead of real geo distance here - I can
use a range on both longitude and latitude. Would it be faster?


On Jul 18, 6:28 am, Clinton Gormley <[hidden email]> wrote:

> Hi Michael
>
> > As a user types information about a member I need to provide a list of
> > suggestions based on the ...: member first/last name, his member id
> > and date of birth.
>
> > It is my understanding  that I need to index my membership info (~100M
> > records) so that I can use edit distance and metaphone against first
> > name/ last name. I also would like to use a list of synonyms for the
> > first name ('Bob' vs 'Robert').
>
> Yes - preparing your data correctly is essential.
>
> > It also makes sense to use edit
> > distance against member id - to account for typing errors.
>
> Do you really think this is the case? If somebody is typing a member ID,
> then I think they should see JUST the associated user.  Otherwise, if
> all I have is the member ID, I type that, and it shows me 20 different
> users, how do I know which is the one I want? I'll ignore this
> requirement.
>
> > Date of birth I would like to use as a filter - if it is there I only
> > show members matching it.
> > Also to further narrow the list I would only include members who live
> > within certain distance of the service location.
>
> I apologise in advance - this email is long, but is well worth reading
> (and I should probably turn it into a tutorial, as this question is
> asked often):
>
> OK, so there are two phases here:
> 1) preparing your data, and
> 2) searching
>
> PREPARING YOUR DATA:
> --------------------
>
> First, let's decide how each field needs to be indexed, then we can look
> at what analyzers we need to provide.
>
>  - first name / last name:
>    - these are string fields
>
>    - we want to use synonyms (eg Robert vs Bob)
>      http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
>
>    - we want to include metaphones
>      http://www.elasticsearch.org/guide/reference/index-modules/analysis/p...
>
>    - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
>      edge ngrams:http://www.elasticsearch.org/guide/reference/index-modules/analysis/e...
>
>    - and let's also use ascii-folding to allow 'éëè' to match 'e'
>      http://www.elasticsearch.org/guide/reference/index-modules/analysis/a...
>
>    - we want to do 3 types of matches:
>       - most relevant:  full word matches
>       - less relevant:  partial word matches (eg with ngrams, synonyms)
>       - least relevant: metaphone matches
>
>      so we'll index the names with three versions, as a multi-field
>      http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...
>
>  - member id
>      - you didn't specify if this is an numeric or alphanumeric, so I'll
>        just assume alphanumeric, possibly with punctuation, eg
>        "ABC-1234"
>
>      - let's say that we want to tokenize this as "abc","1234", so we'll
>        use the "simple" analyzer
>        http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
>
>  - birthday
>      - this is just a date field, no analysis needed
>
>  - location
>      - this is a geo_point field, no analysis needed
>
> So we have a list of custom analyzers we need to define:
>
>   - full_name:
>        - standard token filter
>        - lowercase
>        - ascii folding
>
>   - partial_name:
>        - standard token filter
>        - lowercase
>        - ascii folding
>        - synonyms
>        - edge ngrams
>
>   - name_metaphone:
>        - standard token filter
>        - phonetic/metaphone filter
>
> Here is the command to create the index with the above analyzers and
> mapping:https://gist.github.com/1088986
>
> It is quite long, but just goes through the process listed above. If you
> look at each block, it's actually quite simple.
>
> NOTES:
> 1) For first_name/last_name, I am using multi-fields.  The "main"
>    sub-field has the same name as the top level, so that if I refer
>    to "first_name" it automatically references "first_name.first_name"
>
>    So in effect, I have "first_name" and "first_name.partial" and
>    "first_name.metaphone"
>
> 2) In the partial name fields, I am using index_analyzer and
>    search_analyzer.
>
>    Normally, you want your data and search terms to use the same
>    analyzer - this ensures that you are searching for the same
>    terms that are actually stored in ES. For example, in the
>    first_name.metaphone field, I just specify an 'analyzer'
>    (which sets both the search_analyzer and index_analyzer to
>    the same value)
>
>    However, for the partial field, we want them to be different.  If we
>    store the name "Clinton", we want to be able to use auto-complete
>    for search terms like 'clin' (ie partial matches). So at index time,
>    we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
>
>    However, when we search, we don't want 'clin' to match 'cat','cliff'
>    etc.  So we DON'T want to use the ngram tokenizer on search terms.
>
> So, run the commands in the gist above, and then you can experiment with
> searches.
>
> You can see what tokens each analyzer produces with the analyze API.
> Try these queries:
>
> curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
> curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
> curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...
>
> and to check that the ascii folding is working, try 'sánchez' (but URL encoded):
>
> curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyz...
>
> SEARCHING YOUR DATA
> -------------------
>
> Let's get rid of the easy stuff first:
>
> birthday:
>
>         If your user enters a birthday, then you want to filter the
>         results to only include members with a matching birthday:
>
>            { term: { birthday: '1970-10-24' }}
>
> location:
>
>         Use a geo_distance filter to find results within 100km of
>         London:
>
>            { geo_distance: {
>                    distance: "100km",
>                    location: [51.50853, -0.12574]
>            }}
>
> member_id:
>
>      This can use a simple text query:
>
>      { text: { member_id: "abc-1234" }}
>
> OK - now the more interesting stuff: first name and last name.
>
> The logic we want to use here is:
>
>    Show me any name whose first or last name field matches
>    completely or partially, but consider full word matches
>    to be more relevant than partial matches or metaphone
>    matches
>
> We're going to combine these queries using the 'bool' query.  The
> difference between the 'bool' query and the 'dismax' query is that the
> 'bool' query combines the _score/relevance of each matching clause,
> while the 'dismax' query chooses the highest _score from the matching
> clauses.
>
> { bool:
>    { should: [
>        { text: { "first_name": "rob" }},         # full name
>        { text: { "first_name.partial": "rob" }}  # partial match
>        { text: { "first_name.metaphone": "rob"}} # metaphone
>      ]
>
> }}
>
> This will find all docs that match any of the above clauses.
>
> The _score of each matching clause is combined, so a doc which matches
> all 3 clauses will rank higher than a doc that matches just one clause,
> so we already have some ranking here.  
>
> But lets say that we wanted a full word match to be significantly more
> relevant than the other two.  We can change that clause to:
>
>  { text: { "first_name": {
>     query: "rob",
>     boost: 1
>  }}}
>
> Of course, we need to include the same 3 clauses for "last_name" as
> well.
>
> Now, to turn these into a query that we can pass to the Search API:
>
> All search queries must be wrapped in a top-level { query: {....}}
> element, which will contain one of 3 possibilities:
>
> 1) just the bool query
>    { query: { bool: {...}}
>
>    https://gist.github.com/1089180
>
> 2) just one or more filters
>    { query: { constant_score: { filter: {....} }}
>
>    https://gist.github.com/1089206
>
> 3) the bool query combined with one or more filters
>    { query: { filtered: { query: {bool: ...}, filter: {.....} }}}
>
>    https://gist.github.com/1089201
>
> This was long, but I hope it was worth it.
>
> If anything isn't clear, please ask, and I can improve this and turn it
> into a tutorial.
>
> clint
Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

Clinton Gormley

Hi Michael

> 1. The synonyms  - I only need the synonyms on the first name, I
> think. I do not imagine it to be of much use for last names. I am not
> sure if dropping synonyms for the last name would have any impact in
> terms of performance or necessary disk space. Based on your templates
> I hope I understand how to do this.

That's fine, just define one custom analyzer with synonyms, and use that
for first names, and another without synonyms, for the last name.


> 2. Names - one of the problems I foresee is stemming from the fact
> that I want a single input string with all the search parameters
> (except geoloc). The problem is to tell apart the first name from the
> last name. Additional complication here is that a name (both first and
> last) can consist of several words. My hope was that I can build
> indexes/queries in such a way so that I can throw both names as a
> single string at the query leaving it to the ES to figure out which
> one is which

In my example, "rob smith" looks for "rob OR smith", and you're running
that against first name and last name, so you can use the same search
string.  It will find 'rob' in first name and 'smith' in last name.

> 3. Edge ngrams - I would like to limit wild card search to first let
> us say 6 chars, assuming that more than that implies exact match. my
> clumsy experiments with the ES made me think that if ngrams are in the
> play, they have to be all the way through the max length of the field,
> otherwise the search misses exact matches. I was doing something
> wrong, I hope

You can limit the ngrams, but there is no real reason to do so.  Also,
you have the full word version of the field which it will match against.
In the bool query I'm using 'should' (which is like 'or') so not all
fields need to match.

> 4. Member ID - it is alphanumeric. I still think that some degree of
> fuzziness can help here. The idea is to let the user type anything he
> knows and provide autosuggest as he types. So if he knows the ID -
> great it should be an immediate hit, but if he mistyped it and also
> provided last name - it still can let me to make a pretty well
> educated guess. So why do not do it? I think though that the edit
> distance allowed here should be minimal.

That's probably fine - with the text query you can use the 'fuzziness'
parameter,

> 5. Geo - I am curious about the performance of this. Does it really
> does sqrt from sum of squares? Does it mean that during this process
> it actually loops through all documents to find the matching ones? The
> reason I am asking is that I do not insist on the circle - I can get
> away with a square, I mean instead of real geo distance here - I can
> use a range on both longitude and latitude. Would it be faster?

geo_distance is fast.  no idea how it works internally, but no worries
there

clint


>
>
> On Jul 18, 6:28 am, Clinton Gormley <[hidden email]> wrote:
> > Hi Michael
> >
> > > As a user types information about a member I need to provide a list of
> > > suggestions based on the ...: member first/last name, his member id
> > > and date of birth.
> >
> > > It is my understanding  that I need to index my membership info (~100M
> > > records) so that I can use edit distance and metaphone against first
> > > name/ last name. I also would like to use a list of synonyms for the
> > > first name ('Bob' vs 'Robert').
> >
> > Yes - preparing your data correctly is essential.
> >
> > > It also makes sense to use edit
> > > distance against member id - to account for typing errors.
> >
> > Do you really think this is the case? If somebody is typing a member ID,
> > then I think they should see JUST the associated user.  Otherwise, if
> > all I have is the member ID, I type that, and it shows me 20 different
> > users, how do I know which is the one I want? I'll ignore this
> > requirement.
> >
> > > Date of birth I would like to use as a filter - if it is there I only
> > > show members matching it.
> > > Also to further narrow the list I would only include members who live
> > > within certain distance of the service location.
> >
> > I apologise in advance - this email is long, but is well worth reading
> > (and I should probably turn it into a tutorial, as this question is
> > asked often):
> >
> > OK, so there are two phases here:
> > 1) preparing your data, and
> > 2) searching
> >
> > PREPARING YOUR DATA:
> > --------------------
> >
> > First, let's decide how each field needs to be indexed, then we can look
> > at what analyzers we need to provide.
> >
> >  - first name / last name:
> >    - these are string fields
> >
> >    - we want to use synonyms (eg Robert vs Bob)
> >      http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
> >
> >    - we want to include metaphones
> >      http://www.elasticsearch.org/guide/reference/index-modules/analysis/p...
> >
> >    - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
> >      edge ngrams:http://www.elasticsearch.org/guide/reference/index-modules/analysis/e...
> >
> >    - and let's also use ascii-folding to allow 'éëè' to match 'e'
> >      http://www.elasticsearch.org/guide/reference/index-modules/analysis/a...
> >
> >    - we want to do 3 types of matches:
> >       - most relevant:  full word matches
> >       - less relevant:  partial word matches (eg with ngrams, synonyms)
> >       - least relevant: metaphone matches
> >
> >      so we'll index the names with three versions, as a multi-field
> >      http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...
> >
> >  - member id
> >      - you didn't specify if this is an numeric or alphanumeric, so I'll
> >        just assume alphanumeric, possibly with punctuation, eg
> >        "ABC-1234"
> >
> >      - let's say that we want to tokenize this as "abc","1234", so we'll
> >        use the "simple" analyzer
> >        http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
> >
> >  - birthday
> >      - this is just a date field, no analysis needed
> >
> >  - location
> >      - this is a geo_point field, no analysis needed
> >
> > So we have a list of custom analyzers we need to define:
> >
> >   - full_name:
> >        - standard token filter
> >        - lowercase
> >        - ascii folding
> >
> >   - partial_name:
> >        - standard token filter
> >        - lowercase
> >        - ascii folding
> >        - synonyms
> >        - edge ngrams
> >
> >   - name_metaphone:
> >        - standard token filter
> >        - phonetic/metaphone filter
> >
> > Here is the command to create the index with the above analyzers and
> > mapping:https://gist.github.com/1088986
> >
> > It is quite long, but just goes through the process listed above. If you
> > look at each block, it's actually quite simple.
> >
> > NOTES:
> > 1) For first_name/last_name, I am using multi-fields.  The "main"
> >    sub-field has the same name as the top level, so that if I refer
> >    to "first_name" it automatically references "first_name.first_name"
> >
> >    So in effect, I have "first_name" and "first_name.partial" and
> >    "first_name.metaphone"
> >
> > 2) In the partial name fields, I am using index_analyzer and
> >    search_analyzer.
> >
> >    Normally, you want your data and search terms to use the same
> >    analyzer - this ensures that you are searching for the same
> >    terms that are actually stored in ES. For example, in the
> >    first_name.metaphone field, I just specify an 'analyzer'
> >    (which sets both the search_analyzer and index_analyzer to
> >    the same value)
> >
> >    However, for the partial field, we want them to be different.  If we
> >    store the name "Clinton", we want to be able to use auto-complete
> >    for search terms like 'clin' (ie partial matches). So at index time,
> >    we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
> >
> >    However, when we search, we don't want 'clin' to match 'cat','cliff'
> >    etc.  So we DON'T want to use the ngram tokenizer on search terms.
> >
> > So, run the commands in the gist above, and then you can experiment with
> > searches.
> >
> > You can see what tokens each analyzer produces with the analyze API.
> > Try these queries:
> >
> > curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
> > curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
> > curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...
> >
> > and to check that the ascii folding is working, try 'sánchez' (but URL encoded):
> >
> > curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyz...
> >
> > SEARCHING YOUR DATA
> > -------------------
> >
> > Let's get rid of the easy stuff first:
> >
> > birthday:
> >
> >         If your user enters a birthday, then you want to filter the
> >         results to only include members with a matching birthday:
> >
> >            { term: { birthday: '1970-10-24' }}
> >
> > location:
> >
> >         Use a geo_distance filter to find results within 100km of
> >         London:
> >
> >            { geo_distance: {
> >                    distance: "100km",
> >                    location: [51.50853, -0.12574]
> >            }}
> >
> > member_id:
> >
> >      This can use a simple text query:
> >
> >      { text: { member_id: "abc-1234" }}
> >
> > OK - now the more interesting stuff: first name and last name.
> >
> > The logic we want to use here is:
> >
> >    Show me any name whose first or last name field matches
> >    completely or partially, but consider full word matches
> >    to be more relevant than partial matches or metaphone
> >    matches
> >
> > We're going to combine these queries using the 'bool' query.  The
> > difference between the 'bool' query and the 'dismax' query is that the
> > 'bool' query combines the _score/relevance of each matching clause,
> > while the 'dismax' query chooses the highest _score from the matching
> > clauses.
> >
> > { bool:
> >    { should: [
> >        { text: { "first_name": "rob" }},         # full name
> >        { text: { "first_name.partial": "rob" }}  # partial match
> >        { text: { "first_name.metaphone": "rob"}} # metaphone
> >      ]
> >
> > }}
> >
> > This will find all docs that match any of the above clauses.
> >
> > The _score of each matching clause is combined, so a doc which matches
> > all 3 clauses will rank higher than a doc that matches just one clause,
> > so we already have some ranking here.  
> >
> > But lets say that we wanted a full word match to be significantly more
> > relevant than the other two.  We can change that clause to:
> >
> >  { text: { "first_name": {
> >     query: "rob",
> >     boost: 1
> >  }}}
> >
> > Of course, we need to include the same 3 clauses for "last_name" as
> > well.
> >
> > Now, to turn these into a query that we can pass to the Search API:
> >
> > All search queries must be wrapped in a top-level { query: {....}}
> > element, which will contain one of 3 possibilities:
> >
> > 1) just the bool query
> >    { query: { bool: {...}}
> >
> >    https://gist.github.com/1089180
> >
> > 2) just one or more filters
> >    { query: { constant_score: { filter: {....} }}
> >
> >    https://gist.github.com/1089206
> >
> > 3) the bool query combined with one or more filters
> >    { query: { filtered: { query: {bool: ...}, filter: {.....} }}}
> >
> >    https://gist.github.com/1089201
> >
> > This was long, but I hope it was worth it.
> >
> > If anything isn't clear, please ask, and I can improve this and turn it
> > into a tutorial.
> >
> > clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query

kimchy
Administrator
In reply to this post by mfeingold
I suggest you start with mapping. Start with a simple one, where you have mappings set for different elements (first name, last name), with custom analyzers that you define that use what you want. You might need to use multi_field mapping type if you want to have several analyzers applied to the same field.


On Sun, Jul 17, 2011 at 11:17 PM, Michael Feingold <[hidden email]> wrote:
Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.

It is my understanding  that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?


Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

lalit mishra
In reply to this post by Clinton Gormley
Hi Clinton,
In place of ngram can I use prefixQuery to serve the purpose is there any advantage of using ngram tokenizer?

Below configuration is an example
    "tweet" : { 
        "properties" : { 
            "shortName" : { 
                "type" : "multi_field", 
                "fields" : { 
                    "name" : {"type" : "string", "index" : "analyzed"}, 
                    "untouched" : {"type" : "string", "index" : "not_analyzed"} 
                } 
            } 
        } 
    } 
}

query name.untouched for exact search using textPhrase
and prefix query for partial search 

Please let me know if you thing otherwise

Thanks,
Lalit.

On Mon, Jul 18, 2011 at 4:58 PM, Clinton Gormley <[hidden email]> wrote:
Hi Michael

> As a user types information about a member I need to provide a list of
> suggestions based on the ...: member first/last name, his member id
> and date of birth.

>
> It is my understanding  that I need to index my membership info (~100M
> records) so that I can use edit distance and metaphone against first
> name/ last name. I also would like to use a list of synonyms for the
> first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

> It also makes sense to use edit
> distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user.  Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

> Date of birth I would like to use as a filter - if it is there I only
> show members matching it.

> Also to further narrow the list I would only include members who live
> within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:
1) preparing your data, and
2) searching

PREPARING YOUR DATA:
--------------------

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

 - first name / last name:
  - these are string fields

  - we want to use synonyms (eg Robert vs Bob)
    http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html

  - we want to include metaphones
    http://www.elasticsearch.org/guide/reference/index-modules/analysis/phonetic-tokenfilter.html

  - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
    edge ngrams: http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenfilter.html

  - and let's also use ascii-folding to allow 'éëè' to match 'e'
    http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter.html

  - we want to do 3 types of matches:
     - most relevant:  full word matches
     - less relevant:  partial word matches (eg with ngrams, synonyms)
     - least relevant: metaphone matches

    so we'll index the names with three versions, as a multi-field
    http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

 - member id
    - you didn't specify if this is an numeric or alphanumeric, so I'll
      just assume alphanumeric, possibly with punctuation, eg
      "ABC-1234"

    - let's say that we want to tokenize this as "abc","1234", so we'll
      use the "simple" analyzer
      http://www.elasticsearch.org/guide/reference/index-modules/analysis/simple-analyzer.html

 - birthday
    - this is just a date field, no analysis needed

 - location
    - this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

 - full_name:
      - standard token filter
      - lowercase
      - ascii folding

 - partial_name:
      - standard token filter
      - lowercase
      - ascii folding
      - synonyms
      - edge ngrams

 - name_metaphone:
      - standard token filter
      - phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping: https://gist.github.com/1088986

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:
1) For first_name/last_name, I am using multi-fields.  The "main"
  sub-field has the same name as the top level, so that if I refer
  to "first_name" it automatically references "first_name.first_name"

  So in effect, I have "first_name" and "first_name.partial" and
  "first_name.metaphone"

2) In the partial name fields, I am using index_analyzer and
  search_analyzer.

  Normally, you want your data and search terms to use the same
  analyzer - this ensures that you are searching for the same
  terms that are actually stored in ES. For example, in the
  first_name.metaphone field, I just specify an 'analyzer'
  (which sets both the search_analyzer and index_analyzer to
  the same value)

  However, for the partial field, we want them to be different.  If we
  store the name "Clinton", we want to be able to use auto-complete
  for search terms like 'clin' (ie partial matches). So at index time,
  we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

  However, when we search, we don't want 'clin' to match 'cat','cliff'
  etc.  So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone'

and to check that the ascii folding is working, try 'sánchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyzer=partial_name'


SEARCHING YOUR DATA
-------------------

Let's get rid of the easy stuff first:

birthday:

       If your user enters a birthday, then you want to filter the
       results to only include members with a matching birthday:

          { term: { birthday: '1970-10-24' }}

location:

       Use a geo_distance filter to find results within 100km of
       London:

          { geo_distance: {
                  distance: "100km",
                  location: [51.50853, -0.12574]
          }}

member_id:

    This can use a simple text query:

    { text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

  Show me any name whose first or last name field matches
  completely or partially, but consider full word matches
  to be more relevant than partial matches or metaphone
  matches

We're going to combine these queries using the 'bool' query.  The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
  { should: [
      { text: { "first_name": "rob" }},         # full name
      { text: { "first_name.partial": "rob" }}  # partial match
      { text: { "first_name.metaphone": "rob"}} # metaphone
    ]
}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two.  We can change that clause to:

 { text: { "first_name": {
   query: "rob",
   boost: 1
 }}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

1) just the bool query
  { query: { bool: {...}}

  https://gist.github.com/1089180

2) just one or more filters
  { query: { constant_score: { filter: {....} }}

  https://gist.github.com/1089206

3) the bool query combined with one or more filters
  { query: { filtered: { query: {bool: ...}, filter: {.....} }}}

  https://gist.github.com/1089201


This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint



Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

Clinton Gormley
Hi Lalit

> In place of ngram can I use prefixQuery to serve the purpose is there
> any advantage of using ngram tokenizer?

Performance.  The prefix query is easy to use, but nowhere near as
efficient.  first it needs to find all terms which might match, then run
queries on all of those.  And, you may have too many matching terms,
etc.

So prefix query is fine for small numbers of terms, but ngrams will
scale

clint



Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

lalit mishra
Thanks Clinton for quick response.

My knowledge with edge ngram is very less can you please put somelight over what actually is edge ngram ?
Even I would like to use edge ngram

Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley <[hidden email]> wrote:
Hi Lalit

> In place of ngram can I use prefixQuery to serve the purpose is there
> any advantage of using ngram tokenizer?

Performance.  The prefix query is easy to use, but nowhere near as
efficient.  first it needs to find all terms which might match, then run
queries on all of those.  And, you may have too many matching terms,
etc.

So prefix query is fine for small numbers of terms, but ngrams will
scale

clint




Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

Clinton Gormley

>
> My knowledge with edge ngram is very less can you please put somelight
> over what actually is edge ngram ?
> Even I would like to use edge ngram

An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"

An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help"  or (from the end) "help","elp","lp","p"

clint

>
> Thanks,
> Lalit.
>
> On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
> <[hidden email]> wrote:
>         Hi Lalit
>        
>         > In place of ngram can I use prefixQuery to serve the purpose
>         is there
>         > any advantage of using ngram tokenizer?
>        
>        
>         Performance.  The prefix query is easy to use, but nowhere
>         near as
>         efficient.  first it needs to find all terms which might
>         match, then run
>         queries on all of those.  And, you may have too many matching
>         terms,
>         etc.
>        
>         So prefix query is fine for small numbers of terms, but ngrams
>         will
>         scale
>        
>         clint
>        
>        
>        
>
>

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Reply | Threaded
Open this post in threaded view
|

Re: help needed with the query (partial matching)

lalit mishra
Cool Thanks :)

On Tue, Jul 19, 2011 at 9:57 PM, Clinton Gormley <[hidden email]> wrote:

>
> My knowledge with edge ngram is very less can you please put somelight
> over what actually is edge ngram ?
> Even I would like to use edge ngram

An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"

An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help"  or (from the end) "help","elp","lp","p"

clint
>
> Thanks,
> Lalit.
>
> On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
> <[hidden email]> wrote:
>         Hi Lalit
>
>         > In place of ngram can I use prefixQuery to serve the purpose
>         is there
>         > any advantage of using ngram tokenizer?
>
>
>         Performance.  The prefix query is easy to use, but nowhere
>         near as
>         efficient.  first it needs to find all terms which might
>         match, then run
>         queries on all of those.  And, you may have too many matching
>         terms,
>         etc.
>
>         So prefix query is fine for small numbers of terms, but ngrams
>         will
>         scale
>
>         clint
>
>
>
>
>

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.