I'm working on a task to reproduce a (already working in another infrastructure) algorithm of detecting document inter-similarity
in a big array of documents, trying to benefit from ElasticSearch's speed versus own own sloppy index.
Each incoming document gets split into 4-grams (shingles), throwing away all words less than 4 characters long on the way. In our own
version of the algorithm, this creates patterns unique enough to match them one-by-one. Final score from document A to document B is
number of matching shingles in A and B divided by total number of shingles. Precision of the algorithm is good enough for us at the
Is there any way to reproduce this scheme in ES?
We've tried doing so using:
- More like this
- Split document A into shingles, then create query like
etc., with all the shingles.
While the result we get is quite similar to one we receive from our algorithm, there's no way to map the score to some absolute scale (like from 1 to 100, with score absolute according to all documents in set). The closest candidate to what we're looking for is finding the ID of document A, using it's match score as 100%, then recalculating all scores relative to this one.
However, the current similarity scheme is not really reverse-mappable into our scale. What direction should we look up to - hacking some scoring parameters or going straight to writing our own similarity plugin?