|
|
Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.
It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.
All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?
|
|
Hi Michael
> As a user types information about a member I need to provide a list of
> suggestions based on the ...: member first/last name, his member id
> and date of birth.
>
> It is my understanding that I need to index my membership info (~100M
> records) so that I can use edit distance and metaphone against first
> name/ last name. I also would like to use a list of synonyms for the
> first name ('Bob' vs 'Robert').
Yes - preparing your data correctly is essential.
> It also makes sense to use edit
> distance against member id - to account for typing errors.
Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.
> Date of birth I would like to use as a filter - if it is there I only
> show members matching it.
> Also to further narrow the list I would only include members who live
> within certain distance of the service location.
I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):
OK, so there are two phases here:
1) preparing your data, and
2) searching
PREPARING YOUR DATA:
--------------------
First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.
- first name / last name:
- these are string fields
- we want to use synonyms (eg Robert vs Bob)
http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html - we want to include metaphones
http://www.elasticsearch.org/guide/reference/index-modules/analysis/phonetic-tokenfilter.html - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
edge ngrams: http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenfilter.html - and let's also use ascii-folding to allow 'éëè' to match 'e'
http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter.html - we want to do 3 types of matches:
- most relevant: full word matches
- less relevant: partial word matches (eg with ngrams, synonyms)
- least relevant: metaphone matches
so we'll index the names with three versions, as a multi-field
http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html - member id
- you didn't specify if this is an numeric or alphanumeric, so I'll
just assume alphanumeric, possibly with punctuation, eg
"ABC-1234"
- let's say that we want to tokenize this as "abc","1234", so we'll
use the "simple" analyzer
http://www.elasticsearch.org/guide/reference/index-modules/analysis/simple-analyzer.html - birthday
- this is just a date field, no analysis needed
- location
- this is a geo_point field, no analysis needed
So we have a list of custom analyzers we need to define:
- full_name:
- standard token filter
- lowercase
- ascii folding
- partial_name:
- standard token filter
- lowercase
- ascii folding
- synonyms
- edge ngrams
- name_metaphone:
- standard token filter
- phonetic/metaphone filter
Here is the command to create the index with the above analyzers and
mapping: https://gist.github.com/1088986It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.
NOTES:
1) For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"
So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"
2) In the partial name fields, I am using index_analyzer and
search_analyzer.
Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)
However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.
So, run the commands in the gist above, and then you can experiment with
searches.
You can see what tokens each analyzer produces with the analyze API.
Try these queries:
curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name'
curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone'
and to check that the ascii folding is working, try 'sánchez' (but URL encoded):
curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyzer=partial_name'
SEARCHING YOUR DATA
-------------------
Let's get rid of the easy stuff first:
birthday:
If your user enters a birthday, then you want to filter the
results to only include members with a matching birthday:
{ term: { birthday: '1970-10-24' }}
location:
Use a geo_distance filter to find results within 100km of
London:
{ geo_distance: {
distance: "100km",
location: [51.50853, -0.12574]
}}
member_id:
This can use a simple text query:
{ text: { member_id: "abc-1234" }}
OK - now the more interesting stuff: first name and last name.
The logic we want to use here is:
Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches
We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.
{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}
This will find all docs that match any of the above clauses.
The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.
But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:
{ text: { "first_name": {
query: "rob",
boost: 1
}}}
Of course, we need to include the same 3 clauses for "last_name" as
well.
Now, to turn these into a query that we can pass to the Search API:
All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:
1) just the bool query
{ query: { bool: {...}}
https://gist.github.com/10891802) just one or more filters
{ query: { constant_score: { filter: {....} }}
https://gist.github.com/10892063) the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}
https://gist.github.com/1089201This was long, but I hope it was worth it.
If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.
clint
|
|
Hi Clinton:
Thanks for a quick and detailed response.
To clarify a few points:
1. The synonyms - I only need the synonyms on the first name, I
think. I do not imagine it to be of much use for last names. I am not
sure if dropping synonyms for the last name would have any impact in
terms of performance or necessary disk space. Based on your templates
I hope I understand how to do this.
2. Names - one of the problems I foresee is stemming from the fact
that I want a single input string with all the search parameters
(except geoloc). The problem is to tell apart the first name from the
last name. Additional complication here is that a name (both first and
last) can consist of several words. My hope was that I can build
indexes/queries in such a way so that I can throw both names as a
single string at the query leaving it to the ES to figure out which
one is which
3. Edge ngrams - I would like to limit wild card search to first let
us say 6 chars, assuming that more than that implies exact match. my
clumsy experiments with the ES made me think that if ngrams are in the
play, they have to be all the way through the max length of the field,
otherwise the search misses exact matches. I was doing something
wrong, I hope
4. Member ID - it is alphanumeric. I still think that some degree of
fuzziness can help here. The idea is to let the user type anything he
knows and provide autosuggest as he types. So if he knows the ID -
great it should be an immediate hit, but if he mistyped it and also
provided last name - it still can let me to make a pretty well
educated guess. So why do not do it? I think though that the edit
distance allowed here should be minimal.
5. Geo - I am curious about the performance of this. Does it really
does sqrt from sum of squares? Does it mean that during this process
it actually loops through all documents to find the matching ones? The
reason I am asking is that I do not insist on the circle - I can get
away with a square, I mean instead of real geo distance here - I can
use a range on both longitude and latitude. Would it be faster?
On Jul 18, 6:28 am, Clinton Gormley < [hidden email]> wrote:
> Hi Michael
>
> > As a user types information about a member I need to provide a list of
> > suggestions based on the ...: member first/last name, his member id
> > and date of birth.
>
> > It is my understanding that I need to index my membership info (~100M
> > records) so that I can use edit distance and metaphone against first
> > name/ last name. I also would like to use a list of synonyms for the
> > first name ('Bob' vs 'Robert').
>
> Yes - preparing your data correctly is essential.
>
> > It also makes sense to use edit
> > distance against member id - to account for typing errors.
>
> Do you really think this is the case? If somebody is typing a member ID,
> then I think they should see JUST the associated user. Otherwise, if
> all I have is the member ID, I type that, and it shows me 20 different
> users, how do I know which is the one I want? I'll ignore this
> requirement.
>
> > Date of birth I would like to use as a filter - if it is there I only
> > show members matching it.
> > Also to further narrow the list I would only include members who live
> > within certain distance of the service location.
>
> I apologise in advance - this email is long, but is well worth reading
> (and I should probably turn it into a tutorial, as this question is
> asked often):
>
> OK, so there are two phases here:
> 1) preparing your data, and
> 2) searching
>
> PREPARING YOUR DATA:
> --------------------
>
> First, let's decide how each field needs to be indexed, then we can look
> at what analyzers we need to provide.
>
> - first name / last name:
> - these are string fields
>
> - we want to use synonyms (eg Robert vs Bob)
> http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
>
> - we want to include metaphones
> http://www.elasticsearch.org/guide/reference/index-modules/analysis/p...
>
> - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
> edge ngrams: http://www.elasticsearch.org/guide/reference/index-modules/analysis/e...
>
> - and let's also use ascii-folding to allow 'éëè' to match 'e'
> http://www.elasticsearch.org/guide/reference/index-modules/analysis/a...
>
> - we want to do 3 types of matches:
> - most relevant: full word matches
> - less relevant: partial word matches (eg with ngrams, synonyms)
> - least relevant: metaphone matches
>
> so we'll index the names with three versions, as a multi-field
> http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...
>
> - member id
> - you didn't specify if this is an numeric or alphanumeric, so I'll
> just assume alphanumeric, possibly with punctuation, eg
> "ABC-1234"
>
> - let's say that we want to tokenize this as "abc","1234", so we'll
> use the "simple" analyzer
> http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
>
> - birthday
> - this is just a date field, no analysis needed
>
> - location
> - this is a geo_point field, no analysis needed
>
> So we have a list of custom analyzers we need to define:
>
> - full_name:
> - standard token filter
> - lowercase
> - ascii folding
>
> - partial_name:
> - standard token filter
> - lowercase
> - ascii folding
> - synonyms
> - edge ngrams
>
> - name_metaphone:
> - standard token filter
> - phonetic/metaphone filter
>
> Here is the command to create the index with the above analyzers and
> mapping: https://gist.github.com/1088986>
> It is quite long, but just goes through the process listed above. If you
> look at each block, it's actually quite simple.
>
> NOTES:
> 1) For first_name/last_name, I am using multi-fields. The "main"
> sub-field has the same name as the top level, so that if I refer
> to "first_name" it automatically references "first_name.first_name"
>
> So in effect, I have "first_name" and "first_name.partial" and
> "first_name.metaphone"
>
> 2) In the partial name fields, I am using index_analyzer and
> search_analyzer.
>
> Normally, you want your data and search terms to use the same
> analyzer - this ensures that you are searching for the same
> terms that are actually stored in ES. For example, in the
> first_name.metaphone field, I just specify an 'analyzer'
> (which sets both the search_analyzer and index_analyzer to
> the same value)
>
> However, for the partial field, we want them to be different. If we
> store the name "Clinton", we want to be able to use auto-complete
> for search terms like 'clin' (ie partial matches). So at index time,
> we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
>
> However, when we search, we don't want 'clin' to match 'cat','cliff'
> etc. So we DON'T want to use the ngram tokenizer on search terms.
>
> So, run the commands in the gist above, and then you can experiment with
> searches.
>
> You can see what tokens each analyzer produces with the analyze API.
> Try these queries:
>
> curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
> curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
> curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...
>
> and to check that the ascii folding is working, try 'sánchez' (but URL encoded):
>
> curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyz...
>
> SEARCHING YOUR DATA
> -------------------
>
> Let's get rid of the easy stuff first:
>
> birthday:
>
> If your user enters a birthday, then you want to filter the
> results to only include members with a matching birthday:
>
> { term: { birthday: '1970-10-24' }}
>
> location:
>
> Use a geo_distance filter to find results within 100km of
> London:
>
> { geo_distance: {
> distance: "100km",
> location: [51.50853, -0.12574]
> }}
>
> member_id:
>
> This can use a simple text query:
>
> { text: { member_id: "abc-1234" }}
>
> OK - now the more interesting stuff: first name and last name.
>
> The logic we want to use here is:
>
> Show me any name whose first or last name field matches
> completely or partially, but consider full word matches
> to be more relevant than partial matches or metaphone
> matches
>
> We're going to combine these queries using the 'bool' query. The
> difference between the 'bool' query and the 'dismax' query is that the
> 'bool' query combines the _score/relevance of each matching clause,
> while the 'dismax' query chooses the highest _score from the matching
> clauses.
>
> { bool:
> { should: [
> { text: { "first_name": "rob" }}, # full name
> { text: { "first_name.partial": "rob" }} # partial match
> { text: { "first_name.metaphone": "rob"}} # metaphone
> ]
>
> }}
>
> This will find all docs that match any of the above clauses.
>
> The _score of each matching clause is combined, so a doc which matches
> all 3 clauses will rank higher than a doc that matches just one clause,
> so we already have some ranking here.
>
> But lets say that we wanted a full word match to be significantly more
> relevant than the other two. We can change that clause to:
>
> { text: { "first_name": {
> query: "rob",
> boost: 1
> }}}
>
> Of course, we need to include the same 3 clauses for "last_name" as
> well.
>
> Now, to turn these into a query that we can pass to the Search API:
>
> All search queries must be wrapped in a top-level { query: {....}}
> element, which will contain one of 3 possibilities:
>
> 1) just the bool query
> { query: { bool: {...}}
>
> https://gist.github.com/1089180>
> 2) just one or more filters
> { query: { constant_score: { filter: {....} }}
>
> https://gist.github.com/1089206>
> 3) the bool query combined with one or more filters
> { query: { filtered: { query: {bool: ...}, filter: {.....} }}}
>
> https://gist.github.com/1089201>
> This was long, but I hope it was worth it.
>
> If anything isn't clear, please ask, and I can improve this and turn it
> into a tutorial.
>
> clint
|
|
Hi Michael
> 1. The synonyms - I only need the synonyms on the first name, I
> think. I do not imagine it to be of much use for last names. I am not
> sure if dropping synonyms for the last name would have any impact in
> terms of performance or necessary disk space. Based on your templates
> I hope I understand how to do this.
That's fine, just define one custom analyzer with synonyms, and use that
for first names, and another without synonyms, for the last name.
> 2. Names - one of the problems I foresee is stemming from the fact
> that I want a single input string with all the search parameters
> (except geoloc). The problem is to tell apart the first name from the
> last name. Additional complication here is that a name (both first and
> last) can consist of several words. My hope was that I can build
> indexes/queries in such a way so that I can throw both names as a
> single string at the query leaving it to the ES to figure out which
> one is which
In my example, "rob smith" looks for "rob OR smith", and you're running
that against first name and last name, so you can use the same search
string. It will find 'rob' in first name and 'smith' in last name.
> 3. Edge ngrams - I would like to limit wild card search to first let
> us say 6 chars, assuming that more than that implies exact match. my
> clumsy experiments with the ES made me think that if ngrams are in the
> play, they have to be all the way through the max length of the field,
> otherwise the search misses exact matches. I was doing something
> wrong, I hope
You can limit the ngrams, but there is no real reason to do so. Also,
you have the full word version of the field which it will match against.
In the bool query I'm using 'should' (which is like 'or') so not all
fields need to match.
> 4. Member ID - it is alphanumeric. I still think that some degree of
> fuzziness can help here. The idea is to let the user type anything he
> knows and provide autosuggest as he types. So if he knows the ID -
> great it should be an immediate hit, but if he mistyped it and also
> provided last name - it still can let me to make a pretty well
> educated guess. So why do not do it? I think though that the edit
> distance allowed here should be minimal.
That's probably fine - with the text query you can use the 'fuzziness'
parameter,
> 5. Geo - I am curious about the performance of this. Does it really
> does sqrt from sum of squares? Does it mean that during this process
> it actually loops through all documents to find the matching ones? The
> reason I am asking is that I do not insist on the circle - I can get
> away with a square, I mean instead of real geo distance here - I can
> use a range on both longitude and latitude. Would it be faster?
geo_distance is fast. no idea how it works internally, but no worries
there
clint
>
>
> On Jul 18, 6:28 am, Clinton Gormley < [hidden email]> wrote:
> > Hi Michael
> >
> > > As a user types information about a member I need to provide a list of
> > > suggestions based on the ...: member first/last name, his member id
> > > and date of birth.
> >
> > > It is my understanding that I need to index my membership info (~100M
> > > records) so that I can use edit distance and metaphone against first
> > > name/ last name. I also would like to use a list of synonyms for the
> > > first name ('Bob' vs 'Robert').
> >
> > Yes - preparing your data correctly is essential.
> >
> > > It also makes sense to use edit
> > > distance against member id - to account for typing errors.
> >
> > Do you really think this is the case? If somebody is typing a member ID,
> > then I think they should see JUST the associated user. Otherwise, if
> > all I have is the member ID, I type that, and it shows me 20 different
> > users, how do I know which is the one I want? I'll ignore this
> > requirement.
> >
> > > Date of birth I would like to use as a filter - if it is there I only
> > > show members matching it.
> > > Also to further narrow the list I would only include members who live
> > > within certain distance of the service location.
> >
> > I apologise in advance - this email is long, but is well worth reading
> > (and I should probably turn it into a tutorial, as this question is
> > asked often):
> >
> > OK, so there are two phases here:
> > 1) preparing your data, and
> > 2) searching
> >
> > PREPARING YOUR DATA:
> > --------------------
> >
> > First, let's decide how each field needs to be indexed, then we can look
> > at what analyzers we need to provide.
> >
> > - first name / last name:
> > - these are string fields
> >
> > - we want to use synonyms (eg Robert vs Bob)
> > http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
> >
> > - we want to include metaphones
> > http://www.elasticsearch.org/guide/reference/index-modules/analysis/p...
> >
> > - we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
> > edge ngrams: http://www.elasticsearch.org/guide/reference/index-modules/analysis/e...
> >
> > - and let's also use ascii-folding to allow 'éëè' to match 'e'
> > http://www.elasticsearch.org/guide/reference/index-modules/analysis/a...
> >
> > - we want to do 3 types of matches:
> > - most relevant: full word matches
> > - less relevant: partial word matches (eg with ngrams, synonyms)
> > - least relevant: metaphone matches
> >
> > so we'll index the names with three versions, as a multi-field
> > http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...
> >
> > - member id
> > - you didn't specify if this is an numeric or alphanumeric, so I'll
> > just assume alphanumeric, possibly with punctuation, eg
> > "ABC-1234"
> >
> > - let's say that we want to tokenize this as "abc","1234", so we'll
> > use the "simple" analyzer
> > http://www.elasticsearch.org/guide/reference/index-modules/analysis/s...
> >
> > - birthday
> > - this is just a date field, no analysis needed
> >
> > - location
> > - this is a geo_point field, no analysis needed
> >
> > So we have a list of custom analyzers we need to define:
> >
> > - full_name:
> > - standard token filter
> > - lowercase
> > - ascii folding
> >
> > - partial_name:
> > - standard token filter
> > - lowercase
> > - ascii folding
> > - synonyms
> > - edge ngrams
> >
> > - name_metaphone:
> > - standard token filter
> > - phonetic/metaphone filter
> >
> > Here is the command to create the index with the above analyzers and
> > mapping: https://gist.github.com/1088986> >
> > It is quite long, but just goes through the process listed above. If you
> > look at each block, it's actually quite simple.
> >
> > NOTES:
> > 1) For first_name/last_name, I am using multi-fields. The "main"
> > sub-field has the same name as the top level, so that if I refer
> > to "first_name" it automatically references "first_name.first_name"
> >
> > So in effect, I have "first_name" and "first_name.partial" and
> > "first_name.metaphone"
> >
> > 2) In the partial name fields, I am using index_analyzer and
> > search_analyzer.
> >
> > Normally, you want your data and search terms to use the same
> > analyzer - this ensures that you are searching for the same
> > terms that are actually stored in ES. For example, in the
> > first_name.metaphone field, I just specify an 'analyzer'
> > (which sets both the search_analyzer and index_analyzer to
> > the same value)
> >
> > However, for the partial field, we want them to be different. If we
> > store the name "Clinton", we want to be able to use auto-complete
> > for search terms like 'clin' (ie partial matches). So at index time,
> > we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
> >
> > However, when we search, we don't want 'clin' to match 'cat','cliff'
> > etc. So we DON'T want to use the ngram tokenizer on search terms.
> >
> > So, run the commands in the gist above, and then you can experiment with
> > searches.
> >
> > You can see what tokens each analyzer produces with the analyze API.
> > Try these queries:
> >
> > curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
> > curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
> > curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...
> >
> > and to check that the ascii folding is working, try 'sánchez' (but URL encoded):
> >
> > curl -XGET ' http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyz...
> >
> > SEARCHING YOUR DATA
> > -------------------
> >
> > Let's get rid of the easy stuff first:
> >
> > birthday:
> >
> > If your user enters a birthday, then you want to filter the
> > results to only include members with a matching birthday:
> >
> > { term: { birthday: '1970-10-24' }}
> >
> > location:
> >
> > Use a geo_distance filter to find results within 100km of
> > London:
> >
> > { geo_distance: {
> > distance: "100km",
> > location: [51.50853, -0.12574]
> > }}
> >
> > member_id:
> >
> > This can use a simple text query:
> >
> > { text: { member_id: "abc-1234" }}
> >
> > OK - now the more interesting stuff: first name and last name.
> >
> > The logic we want to use here is:
> >
> > Show me any name whose first or last name field matches
> > completely or partially, but consider full word matches
> > to be more relevant than partial matches or metaphone
> > matches
> >
> > We're going to combine these queries using the 'bool' query. The
> > difference between the 'bool' query and the 'dismax' query is that the
> > 'bool' query combines the _score/relevance of each matching clause,
> > while the 'dismax' query chooses the highest _score from the matching
> > clauses.
> >
> > { bool:
> > { should: [
> > { text: { "first_name": "rob" }}, # full name
> > { text: { "first_name.partial": "rob" }} # partial match
> > { text: { "first_name.metaphone": "rob"}} # metaphone
> > ]
> >
> > }}
> >
> > This will find all docs that match any of the above clauses.
> >
> > The _score of each matching clause is combined, so a doc which matches
> > all 3 clauses will rank higher than a doc that matches just one clause,
> > so we already have some ranking here.
> >
> > But lets say that we wanted a full word match to be significantly more
> > relevant than the other two. We can change that clause to:
> >
> > { text: { "first_name": {
> > query: "rob",
> > boost: 1
> > }}}
> >
> > Of course, we need to include the same 3 clauses for "last_name" as
> > well.
> >
> > Now, to turn these into a query that we can pass to the Search API:
> >
> > All search queries must be wrapped in a top-level { query: {....}}
> > element, which will contain one of 3 possibilities:
> >
> > 1) just the bool query
> > { query: { bool: {...}}
> >
> > https://gist.github.com/1089180> >
> > 2) just one or more filters
> > { query: { constant_score: { filter: {....} }}
> >
> > https://gist.github.com/1089206> >
> > 3) the bool query combined with one or more filters
> > { query: { filtered: { query: {bool: ...}, filter: {.....} }}}
> >
> > https://gist.github.com/1089201> >
> > This was long, but I hope it was worth it.
> >
> > If anything isn't clear, please ask, and I can improve this and turn it
> > into a tutorial.
> >
> > clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
|
Administrator
|
I suggest you start with mapping. Start with a simple one, where you have mappings set for different elements (first name, last name), with custom analyzers that you define that use what you want. You might need to use multi_field mapping type if you want to have several analyzers applied to the same field.
On Sun, Jul 17, 2011 at 11:17 PM, Michael Feingold <[hidden email]> wrote:
Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.
It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.
All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?
|
|
Hi Clinton, In place of ngram can I use prefixQuery to serve the purpose is there any advantage of using ngram tokenizer?
Below configuration is an example { "tweet" : {
"properties" : { "shortName" : { "type" : "multi_field", "fields" : {
"name" : {"type" : "string", "index" : "analyzed"}, "untouched" : {"type" : "string", "index" : "not_analyzed"}
} } } } }
query name.untouched for exact search using textPhrase and prefix query for partial search
Please let me know if you thing otherwise
Thanks, Lalit. On Mon, Jul 18, 2011 at 4:58 PM, Clinton Gormley <[hidden email]> wrote:
Hi Michael
> As a user types information about a member I need to provide a list of
> suggestions based on the ...: member first/last name, his member id
> and date of birth.
>
> It is my understanding that I need to index my membership info (~100M
> records) so that I can use edit distance and metaphone against first
> name/ last name. I also would like to use a list of synonyms for the
> first name ('Bob' vs 'Robert').
Yes - preparing your data correctly is essential.
> It also makes sense to use edit
> distance against member id - to account for typing errors.
Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.
> Date of birth I would like to use as a filter - if it is there I only
> show members matching it.
> Also to further narrow the list I would only include members who live
> within certain distance of the service location.
I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):
OK, so there are two phases here:
1) preparing your data, and
2) searching
PREPARING YOUR DATA:
--------------------
First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.
- first name / last name:
- these are string fields
- we want to use synonyms (eg Robert vs Bob)
http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html
- we want to include metaphones
http://www.elasticsearch.org/guide/reference/index-modules/analysis/phonetic-tokenfilter.html
- we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
edge ngrams: http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenfilter.html
- and let's also use ascii-folding to allow 'éëè' to match 'e'
http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter.html
- we want to do 3 types of matches:
- most relevant: full word matches
- less relevant: partial word matches (eg with ngrams, synonyms)
- least relevant: metaphone matches
so we'll index the names with three versions, as a multi-field
http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html
- member id
- you didn't specify if this is an numeric or alphanumeric, so I'll
just assume alphanumeric, possibly with punctuation, eg
"ABC-1234"
- let's say that we want to tokenize this as "abc","1234", so we'll
use the "simple" analyzer
http://www.elasticsearch.org/guide/reference/index-modules/analysis/simple-analyzer.html
- birthday
- this is just a date field, no analysis needed
- location
- this is a geo_point field, no analysis needed
So we have a list of custom analyzers we need to define:
- full_name:
- standard token filter
- lowercase
- ascii folding
- partial_name:
- standard token filter
- lowercase
- ascii folding
- synonyms
- edge ngrams
- name_metaphone:
- standard token filter
- phonetic/metaphone filter
Here is the command to create the index with the above analyzers and
mapping: https://gist.github.com/1088986
It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.
NOTES:
1) For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"
So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"
2) In the partial name fields, I am using index_analyzer and
search_analyzer.
Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)
However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.
So, run the commands in the gist above, and then you can experiment with
searches.
You can see what tokens each analyzer produces with the analyze API.
Try these queries:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone'
and to check that the ascii folding is working, try 'sánchez' (but URL encoded):
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=s%C3%A1nchez&analyzer=partial_name'
SEARCHING YOUR DATA
-------------------
Let's get rid of the easy stuff first:
birthday:
If your user enters a birthday, then you want to filter the
results to only include members with a matching birthday:
{ term: { birthday: '1970-10-24' }}
location:
Use a geo_distance filter to find results within 100km of
London:
{ geo_distance: {
distance: "100km",
location: [51.50853, -0.12574]
}}
member_id:
This can use a simple text query:
{ text: { member_id: "abc-1234" }}
OK - now the more interesting stuff: first name and last name.
The logic we want to use here is:
Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches
We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.
{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}
This will find all docs that match any of the above clauses.
The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.
But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:
{ text: { "first_name": {
query: "rob",
boost: 1
}}}
Of course, we need to include the same 3 clauses for "last_name" as
well.
Now, to turn these into a query that we can pass to the Search API:
All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:
1) just the bool query
{ query: { bool: {...}}
https://gist.github.com/1089180
2) just one or more filters
{ query: { constant_score: { filter: {....} }}
https://gist.github.com/1089206
3) the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}
https://gist.github.com/1089201
This was long, but I hope it was worth it.
If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.
clint
|
|
Hi Lalit
> In place of ngram can I use prefixQuery to serve the purpose is there
> any advantage of using ngram tokenizer?
Performance. The prefix query is easy to use, but nowhere near as
efficient. first it needs to find all terms which might match, then run
queries on all of those. And, you may have too many matching terms,
etc.
So prefix query is fine for small numbers of terms, but ngrams will
scale
clint
|
|
Thanks Clinton for quick response.
My knowledge with edge ngram is very less can you please put somelight over what actually is edge ngram ? Even I would like to use edge ngram
Thanks, Lalit. On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley <[hidden email]> wrote:
Hi Lalit
> In place of ngram can I use prefixQuery to serve the purpose is there
> any advantage of using ngram tokenizer?
Performance. The prefix query is easy to use, but nowhere near as
efficient. first it needs to find all terms which might match, then run
queries on all of those. And, you may have too many matching terms,
etc.
So prefix query is fine for small numbers of terms, but ngrams will
scale
clint
|
|
>
> My knowledge with edge ngram is very less can you please put somelight
> over what actually is edge ngram ?
> Even I would like to use edge ngram
An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"
An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help" or (from the end) "help","elp","lp","p"
clint
>
> Thanks,
> Lalit.
>
> On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
> < [hidden email]> wrote:
> Hi Lalit
>
> > In place of ngram can I use prefixQuery to serve the purpose
> is there
> > any advantage of using ngram tokenizer?
>
>
> Performance. The prefix query is easy to use, but nowhere
> near as
> efficient. first it needs to find all terms which might
> match, then run
> queries on all of those. And, you may have too many matching
> terms,
> etc.
>
> So prefix query is fine for small numbers of terms, but ngrams
> will
> scale
>
> clint
>
>
>
>
>
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
|
|
Cool Thanks :) On Tue, Jul 19, 2011 at 9:57 PM, Clinton Gormley <[hidden email]> wrote:
>
> My knowledge with edge ngram is very less can you please put somelight
> over what actually is edge ngram ?
> Even I would like to use edge ngram
An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"
An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help" or (from the end) "help","elp","lp","p"
clint
>
> Thanks,
> Lalit.
>
> On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
> < [hidden email]> wrote:
> Hi Lalit
>
> > In place of ngram can I use prefixQuery to serve the purpose
> is there
> > any advantage of using ngram tokenizer?
>
>
> Performance. The prefix query is easy to use, but nowhere
> near as
> efficient. first it needs to find all terms which might
> match, then run
> queries on all of those. And, you may have too many matching
> terms,
> etc.
>
> So prefix query is fine for small numbers of terms, but ngrams
> will
> scale
>
> clint
>
>
>
>
>
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.
|
|