How to plug in alternative Rescorer?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

How to plug in alternative Rescorer?

Otis Gospodnetic
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

simonw-2


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Otis Gospodnetic
Hi Simon,

The idea is to use the Rescorer step as a way to modify matching documents, by taking advantage of the TopDocs collection in the rescore method.

We would like to answer two types of queries:
eg 1: there is a field called "value", and we want to return only the max "valued" documents per document "group", group being another field.
eg 2: create a field synthetically, like ScriptField does, with a value calculated by looking at all the documents in TopDocs.

We thought it made sense to do this as a Rescorer extension, because it provides a way to summarize and aggregate information in each shard, so less overhead over the wire, but if you can think of a cleaner way to hook this ideas please let me know, even it doesn't involve Rescorer.

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Simeon Simeonov
In reply to this post by simonw-2
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Simeon Simeonov
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

George Stathis
I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors. 

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

kimchy
Administrator
Hey,

   This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to *all* users through other means (i.e. custom *logic*). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable *all* users to take advantage of it.

   Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.

   Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

   Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, *not* the external ones). For example, we allow to plug custom Lucene constructs relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <[hidden email]> wrote:

I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors. 

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Simeon Simeonov
Shay, thanks for sharing your design objectives. Since I'm new to ES, can you point me to the areas in the faceting system where you think a general transformation step could plug in, if/when it makes sense to add one.

Thanks,
Sim

On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:
Hey,

   This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to *all* users through other means (i.e. custom *logic*). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable *all* users to take advantage of it.

   Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.

   Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

   Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, *not* the external ones). For example, we allow to plug custom Lucene constructs relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="jiCQyfCniVsJ">gsta...@...> wrote:

I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors. 

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="jiCQyfCniVsJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Otis Gospodnetic
In reply to this post by kimchy
Hi,

If I undertand this correctly, the hope is that higher-level functionality can be exposed in simpler, less "intrusive" way and be used to satisfy not a narrower, more specific use case, but be the enabler for N different user-level features?  That seems fine by me, as long as this higher-level functionality can indeed satisfy things like what Simeon brought up or what I initially asked about.

> Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

Shay & Co, are you referring to something like https://github.com/imotov/elasticsearch-facet-script ?
If not, is there something else in ES itself or elsewhere that we could model things after that you could point us to?

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:
Hey,

   This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to *all* users through other means (i.e. custom *logic*). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable *all* users to take advantage of it.

   Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.

   Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

   Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, *not* the external ones). For example, we allow to plug custom Lucene constructs relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="jiCQyfCniVsJ">gsta...@...> wrote:

I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors. 

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="jiCQyfCniVsJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Sebastian Gavarini
Hi,

Quoting Shay: "Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.".
I think in this case that is not a good fit for facets, as what's needed is not only aggregating, ie. "get the max value of a certain field with a group by restriction", but also manipulating the TopDocs, ie. "removing the not-max-valued documents", and some information must be added to the winning documents too (like ScriptField does).

This specific use case must do something like:
1) run a ES query
2) in each shard get the TopDocs and filter some documents according to a "group by" function and a "unique" restriction. Also add to the winning documents some extra fields like ScriptField does.
3) in the calling node also execute the filtering step 2)

The general use case would be a map/reduce hook that could manipulate the data in the shards locally, and later at reducing. Also there are some calculations done from that reducing step that would need to be added to the returned documents, like ScriptField does.

Do you think it might be a good addition to the current code base?
Would you be interested in coding it or either accepting a pull request for this? For pull request, could you provide a bit of guidance to implement this in a clean way?

Thanks,
Sebastian.



On Mon, Apr 15, 2013 at 4:58 PM, Otis Gospodnetic <[hidden email]> wrote:
Hi,

If I undertand this correctly, the hope is that higher-level functionality can be exposed in simpler, less "intrusive" way and be used to satisfy not a narrower, more specific use case, but be the enabler for N different user-level features?  That seems fine by me, as long as this higher-level functionality can indeed satisfy things like what Simeon brought up or what I initially asked about.

> Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

Shay & Co, are you referring to something like https://github.com/imotov/elasticsearch-facet-script ?
If not, is there something else in ES itself or elsewhere that we could model things after that you could point us to?

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:
Hey,

   This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to *all* users through other means (i.e. custom *logic*). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable *all* users to take advantage of it.

   Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.

   Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

   Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, *not* the external ones). For example, we allow to plug custom Lucene constructs relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <[hidden email]> wrote:

I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors. 

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: How to plug in alternative Rescorer?

Otis Gospodnetic
Hi,

I'm interested in what Sebastian is describing here and if he's right about piggy-backing on faceting not being suitable in the described use-case, could anyone please suggest an alternate route?  Ideally one that might stand the chance of getting pulled into ES?

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html


On Monday, April 15, 2013 6:45:38 PM UTC-4, Sebastian wrote:
Hi,

Quoting Shay: "Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.".
I think in this case that is not a good fit for facets, as what's needed is not only aggregating, ie. "get the max value of a certain field with a group by restriction", but also manipulating the TopDocs, ie. "removing the not-max-valued documents", and some information must be added to the winning documents too (like ScriptField does).

This specific use case must do something like:
1) run a ES query
2) in each shard get the TopDocs and filter some documents according to a "group by" function and a "unique" restriction. Also add to the winning documents some extra fields like ScriptField does.
3) in the calling node also execute the filtering step 2)

The general use case would be a map/reduce hook that could manipulate the data in the shards locally, and later at reducing. Also there are some calculations done from that reducing step that would need to be added to the returned documents, like ScriptField does.

Do you think it might be a good addition to the current code base?
Would you be interested in coding it or either accepting a pull request for this? For pull request, could you provide a bit of guidance to implement this in a clean way?

Thanks,
Sebastian.



On Mon, Apr 15, 2013 at 4:58 PM, Otis Gospodnetic <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="qBsUdRfynwYJ">otis.gos...@...> wrote:
Hi,

If I undertand this correctly, the hope is that higher-level functionality can be exposed in simpler, less "intrusive" way and be used to satisfy not a narrower, more specific use case, but be the enabler for N different user-level features?  That seems fine by me, as long as this higher-level functionality can indeed satisfy things like what Simeon brought up or what I initially asked about.

> Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

Shay & Co, are you referring to something like https://github.com/imotov/elasticsearch-facet-script ?
If not, is there something else in ES itself or elsewhere that we could model things after that you could point us to?

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:
Hey,

   This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to *all* users through other means (i.e. custom *logic*). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable *all* users to take advantage of it.

   Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.

   Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the *nature* of the custom logic differs, but if its aggregations, facet is where it fits.

   Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, *not* the external ones). For example, we allow to plug custom Lucene constructs relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <[hidden email]> wrote:

I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors. 

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

1. Many problems may only require custom processing at the shard level

2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:


On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the `query` rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations.  I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon 

Thanks,
Otis
--
ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html




--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="qBsUdRfynwYJ">elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.