Decay score based on number occurrences

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Decay score based on number occurrences

tsturzl
I'm trying to find a way to prevent multiple posts from appearing in search results that are from the same author. So far I've tried random scoring, which allows me to maintain pagination. However, I can still have up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain field occurs in the result set? As far as I'm aware you cannot persist a variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them have quite a few cons. Such as removing the duplicates, and calling again to retrieve a new set of results which have the current authors excluded. However this can also return multiple of the same authors. So I'm left to query one by one to replace duplicate authors in a result set, and this breaks deep pagination because eventually the other result set which is used to replace duplicates runs out of pages before the standard search. I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a document based on how many times a document of the same author(or field) occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Decay score based on number occurrences

Doug Turnbull
Isn't top_hits aggregration pageable? See the "from" parameter listed on the page:


Certainly you don't want to page through everything (you want scan/scroll for that), but adequate paging for most search uses.

Or do you want to just eliminate duplicate authors only in one page (ie set of 10) of results?

-Doug

On Tuesday, December 9, 2014, Travis sturzl <[hidden email]> wrote:
I'm trying to find a way to prevent multiple posts from appearing in search results that are from the same author. So far I've tried random scoring, which allows me to maintain pagination. However, I can still have up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain field occurs in the result set? As far as I'm aware you cannot persist a variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them have quite a few cons. Such as removing the duplicates, and calling again to retrieve a new set of results which have the current authors excluded. However this can also return multiple of the same authors. So I'm left to query one by one to replace duplicate authors in a result set, and this breaks deep pagination because eventually the other result set which is used to replace duplicates runs out of pages before the standard search. I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a document based on how many times a document of the same author(or field) occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;elasticsearch%2Bunsubscribe@googlegroups.com&#39;);" target="_blank">elasticsearch+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89f4676e-3472-4abf-a182-229299d2149f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Doug Turnbull
Search & Big Data Architect

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALG6HL9JBdZ6aFNA6czCd6%3DqUycC-m63fMree0zh%3DdyPqJ%3DnKQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Decay score based on number occurrences

tsturzl
In reply to this post by tsturzl
Doug,

First of all thanks for the reply. I was under the impression that Aggregations were not page-able, as everything I've read suggested otherwise. I could be wrong, however our marketing team would like our posts to rotate, much like random scoring provides, this way each session a user will see different posts.

The problem that I noticed with pagination in aggregation was, from and size correlate to hits per bucket, however the number of buckets is completely variable. So reducing each hit from every bucket. I have about 90 authors, meaning 90 buckets, with 1 result each. I can limit number of buckets, but I cannot set a "from" value on buckets. I can only define the max amount of buckets.

I'm a little lost as to how to paginate aggregations for that reason. Also, I'm only trying to make sure there are none of the same authors per page, not the entire result set. Deep pagination doesn't have to work, but I'd also like not having to perform more than 1 query per search/page. Whereas the only solution I've come up with is calling one by one to replace the duplicates, which can turn out to mean up to 11 calls. However, some result sets are only 2-3 pages long, so this may also break pagination for small result sets.

I'm just having a very difficult time getting my head around this. ElasticSearch itself doesn't seem to have any feature which can help me produce this desired outcome.

On Tuesday, December 9, 2014 4:40:35 PM UTC-6, Travis sturzl wrote:
I'm trying to find a way to prevent multiple posts from appearing in search results that are from the same author. So far I've tried random scoring, which allows me to maintain pagination. However, I can still have up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain field occurs in the result set? As far as I'm aware you cannot persist a variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them have quite a few cons. Such as removing the duplicates, and calling again to retrieve a new set of results which have the current authors excluded. However this can also return multiple of the same authors. So I'm left to query one by one to replace duplicate authors in a result set, and this breaks deep pagination because eventually the other result set which is used to replace duplicates runs out of pages before the standard search. I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a document based on how many times a document of the same author(or field) occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c08fcbd8-d502-4526-9995-3ceaef6cb477%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Decay score based on number occurrences

Mark Harwood-2
I have work underway in Lucene and elasticsearch for a "diversified" form of results collection : https://github.com/elasticsearch/elasticsearch/pull/8191

On Wednesday, December 10, 2014 3:38:47 AM UTC, Travis sturzl wrote:
Doug,

First of all thanks for the reply. I was under the impression that Aggregations were not page-able, as everything I've read suggested otherwise. I could be wrong, however our marketing team would like our posts to rotate, much like random scoring provides, this way each session a user will see different posts.

The problem that I noticed with pagination in aggregation was, from and size correlate to hits per bucket, however the number of buckets is completely variable. So reducing each hit from every bucket. I have about 90 authors, meaning 90 buckets, with 1 result each. I can limit number of buckets, but I cannot set a "from" value on buckets. I can only define the max amount of buckets.

I'm a little lost as to how to paginate aggregations for that reason. Also, I'm only trying to make sure there are none of the same authors per page, not the entire result set. Deep pagination doesn't have to work, but I'd also like not having to perform more than 1 query per search/page. Whereas the only solution I've come up with is calling one by one to replace the duplicates, which can turn out to mean up to 11 calls. However, some result sets are only 2-3 pages long, so this may also break pagination for small result sets.

I'm just having a very difficult time getting my head around this. ElasticSearch itself doesn't seem to have any feature which can help me produce this desired outcome.

On Tuesday, December 9, 2014 4:40:35 PM UTC-6, Travis sturzl wrote:
I'm trying to find a way to prevent multiple posts from appearing in search results that are from the same author. So far I've tried random scoring, which allows me to maintain pagination. However, I can still have up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain field occurs in the result set? As far as I'm aware you cannot persist a variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them have quite a few cons. Such as removing the duplicates, and calling again to retrieve a new set of results which have the current authors excluded. However this can also return multiple of the same authors. So I'm left to query one by one to replace duplicate authors in a result set, and this breaks deep pagination because eventually the other result set which is used to replace duplicates runs out of pages before the standard search. I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a document based on how many times a document of the same author(or field) occurs?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fa49e635-d174-4fff-b66d-628f5534f06c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.