Need help on similarity ranking approach

classic Classic list List threaded Threaded
9 messages Options
Rgs
Reply | Threaded
Open this post in threaded view
|

Need help on similarity ranking approach

Rgs
Hi,

I'm new to elasticsearch and i wanted to do similarity ranking using it.

Requirement.

need to index documents having two fields (field1 and field2) which are of free text. whenever a new document comes and indexed, needs find out how similar is with the existing documents based on a filed (say field1) and these similarities should be captured. And if the similarity reaches some X%, some action should be done. these steps should be done for all documents which are getting indexed.

My approach

1. whenever a new document comes, it should be indexed first.
2. once its indexed, will start matching the document against existing documents using the field1.
3. on the search results, will check the score field for the similarity percentage and will be captured.
4. find scores which is x%, then do the required action.


Could you please tell whether the approach taken is fine? or have any better way to perform the similarity ranking in such cases?

Thanks
Rgs
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Rgs
Could you guys please help on this?
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Binh Ly-2
I'm not sure you can use the score to determine % similarity. You certainly can for each new incoming document, run a more like this query against your index (and specify a bunch of parameters like percent_terms_to_match) to perhaps achieve something closer to what you want?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html#query-dsl-mlt-query

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/97e2f5bf-1c95-4775-a894-74650cccde12%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Ivan Brusic
In reply to this post by Rgs
The problem with your approach is that Lucene does not provide a score in terms of how similar a document is to a query. The score is based on the (default) TFIDF algorithm and is not an absolute measure. You can score a document against all others, and the scores will be comparable for that one document, but the overall score can vary greatly.

For example, the range of scores of one document against all others might be 0.5 - 30. The range of scores for another document against the same documents might be 1.2 - 24. It would be difficult to establish an overall threshold. You can of course, always find the top % of documents.

The other issue is that the similarity will change as you index more documents. If you only have one document in your index, the similarity score for the next document should different than if you indexed against an index with millions of documents because of the IDF values.

Even if your range of scores is comparable between documents, there is nothing in Elasticsearch to help you with this task. The better question is why do you need to calculate document relevancy between documents and not simply rank documents according to a query?

-- 
Ivan




On Mon, Apr 28, 2014 at 12:34 AM, Rgs <[hidden email]> wrote:
Could you guys please help on this?



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4054889.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1398670453057-4054889.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQByimRSWh9%3D%2BzyJfKG9ijzH-zWWBaVdq7Xc1SvjMeBKTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Rgs
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Rgs
Thanks Binh Ly and Ivan Brusic for your replies.

I need to find the similarity in percentage of a document against other documents and this will be considered for grouping the documents.

is it possible to get the similarity percentage using more like this query? or is any other way to calculate the percentage of similarity from the query result?

Eg:  document1 is 90% similar to document2.
      document1 is 45% similar to document3
      etc..

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Alex Ksikes
Hello,

What you want to know is the score of the document that has matched itself using more like this. The API excludes the queried document. However, it is equivalent to running a boolean query of more like this field for each of the queried document field. This will give you as top result, the document that has matched itself, so that you can compute the percentage of similarity of the remaining matched documents.

Alex

On Friday, May 2, 2014 3:22:34 PM UTC+2, Rgs wrote:
Thanks Binh Ly and Ivan Brusic for your replies.

I need to find the similarity in percentage of a document against other
documents and this will be considered for grouping the documents.

is it possible to get the similarity percentage using more like this query?
or is any other way to calculate the percentage of similarity from the query
result?

Eg:  document1 is 90% similar to document2.
      document1 is 45% similar to document3
      etc..

Thanks



--
View this message in context: <a href="http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4055227.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FNeed-help-on-similarity-ranking-approach-tp4054847p4055227.html\46sa\75D\46sntz\0751\46usg\75AFQjCNE_IhT-DJtviWInGbtLJP3YUR9RkA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FNeed-help-on-similarity-ranking-approach-tp4054847p4055227.html\46sa\75D\46sntz\0751\46usg\75AFQjCNE_IhT-DJtviWInGbtLJP3YUR9RkA';return true;">http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4055227.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05db016b-1c2e-497c-9275-37dcccedfae3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rgs
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Rgs
hi,

What i did now is, i have created a custom similarity & similarity provider class which extends DefaultSimilarity and AbstractSimilarityProvider classes respectively and overridden the idf() method to return 1.

Now I'm getting some percentage values like 1, 0.987, 0.876 etc and interpret it as 100%, 98%, 87% etc.

Can you please confirm whether this approach can be taken for finding the percentage of similarity?

sorry for the late reply.

Thanks
Rgs
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Alex Ksikes
Hello,

I am not sure that would work. I'd first index you document, and then use mlt with this document id and include set to true (added in latest ES release). Then you'll know how "far" your documents are from the queried document. Also, make sure to pick up most of the terms, by setting percent_terms_to_match=0, max_query_terms=high value and min_doc_freq=1. In order to know what terms from the queried document have matched in the response, you can use explain.

Alex

On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote:
hi,

What i did now is, i have created a custom similarity & similarity provider
class which extends DefaultSimilarity and AbstractSimilarityProvider classes
respectively and overridden the idf() method to return 1.

Now I'm getting some percentage values like 1, 0.987, 0.876 etc and
interpret it as 100%, 98%, 87% etc.

Can you please confirm whether this approach can be taken for finding the
percentage of similarity?

sorry for the late reply.

Thanks
Rgs



--
View this message in context: <a href="http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FNeed-help-on-similarity-ranking-approach-tp4054847p4056680.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG-OERIdL_JizOcPK7-aWa15yoYaQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FNeed-help-on-similarity-ranking-approach-tp4054847p4056680.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG-OERIdL_JizOcPK7-aWa15yoYaQ';return true;">http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/184a015f-fe68-4a24-999b-367d60d23798%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Need help on similarity ranking approach

Alex Ksikes
In reply to this post by Rgs
Also this plugin could provide a solution to your problem:

http://yannbrrd.github.io/

On Thursday, May 29, 2014 10:42:47 AM UTC+2, Rgs wrote:
hi,

What i did now is, i have created a custom similarity & similarity provider
class which extends DefaultSimilarity and AbstractSimilarityProvider classes
respectively and overridden the idf() method to return 1.

Now I'm getting some percentage values like 1, 0.987, 0.876 etc and
interpret it as 100%, 98%, 87% etc.

Can you please confirm whether this approach can be taken for finding the
percentage of similarity?

sorry for the late reply.

Thanks
Rgs



--
View this message in context: <a href="http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FNeed-help-on-similarity-ranking-approach-tp4054847p4056680.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG-OERIdL_JizOcPK7-aWa15yoYaQ';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FNeed-help-on-similarity-ranking-approach-tp4054847p4056680.html\46sa\75D\46sntz\0751\46usg\75AFQjCNG-OERIdL_JizOcPK7-aWa15yoYaQ';return true;">http://elasticsearch-users.115913.n3.nabble.com/Need-help-on-similarity-ranking-approach-tp4054847p4056680.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4a2ee12-b9af-4142-a2e9-71b85cc9141c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.