Precise similarity-based scoring?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Precise similarity-based scoring?


I'm working on a task to reproduce a (already working in another infrastructure) algorithm of detecting document inter-similarity 
in a big array of documents, trying to benefit from ElasticSearch's speed versus own own sloppy index.
Each incoming document gets split into 4-grams (shingles), throwing away all words less than 4 characters long on the way. In our own 
version of the algorithm, this creates patterns unique enough to match them one-by-one. Final score from document A to document B is
number of matching shingles in A and B divided by total number of shingles. Precision of the algorithm is good enough for us at the 

Is there any way to reproduce this scheme in ES? 

We've tried doing so using:
- More like this
- Split document A into shingles, then create query like

  "query": {
    "bool": {
      "should": [
          "match": {
            "content": "shingle1"
          "match": {
            "content": "shingle2"

etc., with all the shingles.

While the result we get is quite similar to one we receive from our algorithm, there's no way to map the score to some absolute scale (like from 1 to 100, with score absolute according to all documents in set). The closest candidate to what we're looking for is finding the ID of document A, using it's match score as 100%, then recalculating all scores relative to this one. 

However, the current similarity scheme is not really reverse-mappable into our scale. What direction should we look up to - hacking some scoring parameters or going straight to writing our own similarity plugin?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit