Decorating _search with additional data

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Decorating _search with additional data

Dawid Weiss


Hi everyone,

I have a question related to ES internals (providing an extension to search functionality).

We have a customer that would like to integrate clustering of search results directly with ES so that it happens as part of the search. This functionality is essentially identical to just plain searches, with some additional parameters to determine the clustering algorithm to use, etc. I know medcl already implemented a Carrot2 plugin for ES and I looked at his code but for us we will need something more generic to also allow proprietary clustering algorithms to be used with the plugin in a seamless way. But back to the point.

I've been looking at the architecture and ways this could be accomplished (by the way -- kudos to everyone involved, the code looks and works very cool... bonsai cool) and have a few questions that popped up.

1) It seems that the "nicest" way to accomplish the task in question would be to somehow plug into the search action, ideally as a SearchPhase (or rather a FetchSubPhase). These sub-phases currently seem to be fixed and not extensible.... and I can already see the problems with serialization if the search result is somehow augmented at this level. Do you think it's at all possible (and a good idea) to try to plug it in there?

2) Since (1) seemed very intrusive I temporarily implemented a custom plugin (and an action/ request/ response pair). My code essentially does nothing but delegates most of its internal workings to Search*: currently all the "logic" that actually does the clustering resides in a subclass of TransportAction; in doExecute it delegates to TransportSearchAction, then inside onResponse it clusters the result and returns the augmented response back to the user. 

This works fine but clustering is pretty heavy on computational resources and I wondered if TransportAction is a good place to place this logic and what threading (threadpool) magic should be used to make it fit with the rest of ES. 

Another problem is that the rest handler could be implemented in pretty much the same way but the search-request parsing logic in RestSearchAction#parseSearchRequest is currently private and there is no way to reuse that (and I'd say it begs for reuse since it's far from trivial and copy-paste will most likely go out of sync in future versions).

Thanks for all the tips and hints,
Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

Martijn v Groningen

1) It seems that the "nicest" way to accomplish the task in question would be to somehow plug into the search action, ideally as a SearchPhase (or rather a FetchSubPhase). These sub-phases currently seem to be fixed and not extensible.... and I can already see the problems with serialization if the search result is somehow augmented at this level. Do you think it's at all possible (and a good idea) to try to plug it in there?
Other then forking the code base I don't see that adding your own SearchPhase or FetchSubPhase is possible. There are a few extension points in the codebase: for creating custom queries / filters, tokenizers and token filters, discovery, facets and recently also suggesters and highlighters.
 

2) Since (1) seemed very intrusive I temporarily implemented a custom plugin (and an action/ request/ response pair). My code essentially does nothing but delegates most of its internal workings to Search*: currently all the "logic" that actually does the clustering resides in a subclass of TransportAction; in doExecute it delegates to TransportSearchAction, then inside onResponse it clusters the result and returns the augmented response back to the user. 
I think in the current codebase this is the only way of creating the type of plugin that you need.
 

This works fine but clustering is pretty heavy on computational resources and I wondered if TransportAction is a good place to place this logic and what threading (threadpool) magic should be used to make it fit with the rest of ES. 
Make sure that the clustering does never run on a network thread, so try to offload to a thread from the threadpool. There are a few thread pools, I think in your case you should use the search thread pool.
 

Another problem is that the rest handler could be implemented in pretty much the same way but the search-request parsing logic in RestSearchAction#parseSearchRequest is currently private and there is no way to reuse that (and I'd say it begs for reuse since it's far from trivial and copy-paste will most likely go out of sync in future versions).
I think the parseSearchRequest method can be made protected.

 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

sujoysett
In reply to this post by Dawid Weiss
Hi Dawid,

Is there any specific reason you are not interested in a Facet plugin?

In a application I worked on, we were needed responses similar to Carrot2 Lingo, the only difference was that our data was not search results, but complete social media contents, blogs, news, comments, etc. Also the count of such documents was huge, more than a million, and even a search would fetch not less than few thousand docs.

We developed a custom elasticsearch facet plugin for the same, compatible with ES 0.19.*. We derived a custom distance logic to detect document similarities, and also derived our own full text clustering algorithm in a map-reduce architecture. It is fully functional and pretty fast.

In fact we have plans to rewrite the same plugin with some improvements compatible with ES 0.90.*, but that is still pending since we are occupied with other tasks.

-- Sujoy.



On Wednesday, June 26, 2013 1:40:41 PM UTC+5:30, Dawid Weiss wrote:


Hi everyone,

I have a question related to ES internals (providing an extension to search functionality).

We have a customer that would like to integrate clustering of search results directly with ES so that it happens as part of the search. This functionality is essentially identical to just plain searches, with some additional parameters to determine the clustering algorithm to use, etc. I know medcl already implemented a Carrot2 plugin for ES and I looked at his code but for us we will need something more generic to also allow proprietary clustering algorithms to be used with the plugin in a seamless way. But back to the point.

I've been looking at the architecture and ways this could be accomplished (by the way -- kudos to everyone involved, the code looks and works very cool... bonsai cool) and have a few questions that popped up.

1) It seems that the "nicest" way to accomplish the task in question would be to somehow plug into the search action, ideally as a SearchPhase (or rather a FetchSubPhase). These sub-phases currently seem to be fixed and not extensible.... and I can already see the problems with serialization if the search result is somehow augmented at this level. Do you think it's at all possible (and a good idea) to try to plug it in there?

2) Since (1) seemed very intrusive I temporarily implemented a custom plugin (and an action/ request/ response pair). My code essentially does nothing but delegates most of its internal workings to Search*: currently all the "logic" that actually does the clustering resides in a subclass of TransportAction; in doExecute it delegates to TransportSearchAction, then inside onResponse it clusters the result and returns the augmented response back to the user. 

This works fine but clustering is pretty heavy on computational resources and I wondered if TransportAction is a good place to place this logic and what threading (threadpool) magic should be used to make it fit with the rest of ES. 

Another problem is that the rest handler could be implemented in pretty much the same way but the search-request parsing logic in RestSearchAction#parseSearchRequest is currently private and there is no way to reuse that (and I'd say it begs for reuse since it's far from trivial and copy-paste will most likely go out of sync in future versions).

Thanks for all the tips and hints,
Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

Dawid Weiss
In reply to this post by Martijn v Groningen
Thanks for the hints, Martijn.

> I think the parseSearchRequest method can be made protected.

For me it'd have to be public -- I don't want to subclass, I want to
keep my stuff separate and just delegate parsing of a fragment of my
request that I know is a search request. It'd be nice if it could be
made reusable I guess.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

Dawid Weiss
In reply to this post by sujoysett
> Is there any specific reason you are not interested in a Facet plugin?

For faceting you'd need to extract something and keep it in the index
(named entities, whatever). Clustering is dynamic and you don't need
to do that. There are pros and cons of both (and they are
complimentary most of the time).

Besides, I'm on of the authors of Carrot2 so I have an obvious reason
to stick to my stuff :)

> We derived a custom distance logic to detect document similarities, and also derived our own full text clustering algorithm in a map-reduce architecture. It is fully functional and pretty fast.

This would be an interesting piece of code in general (even for
Mahout). Let me know if you publish it somewhere, I'd be interested.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

Christoph Evers
I would be interested too.

@Dawid: I just discovered Carrot2, great piece of site/software!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

Otis Gospodnetic
In reply to this post by Dawid Weiss
+1 to making this extensible.

If I understand what Dawid did, I would think he would ideally not have to write a custom Rest Action because that means the client needs to be changed and told to go use this new Rest Action, which is not nice.  Ideally custom work being done before and/or after search is executed would be transparent to the client.

I asked more or less the same question a few days ago but got no replies.......... https://groups.google.com/d/msg/elasticsearch/VT_nl8Dwu7o/bUDwzsfxU0gJ

Otis


On Friday, June 28, 2013 1:37:03 AM UTC-4, Christoph Evers wrote:
Hi,

I would be interested too.

@Dawid: I just discovered Carrot2, great piece of site/software!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

joergprante@gmail.com
In reply to this post by Christoph Evers
And there's great plugin of medcl too
https://github.com/medcl/elasticsearch-carrot2

Jörg

Am 28.06.13 07:38, schrieb Christoph Evers:

> I would be interested too.
>
> @Dawid: I just discovered Carrot2, great piece of site/software!
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Decorating _search with additional data

joergprante@gmail.com
In reply to this post by Dawid Weiss
I think the challenge is deeper. All actions in the REST API contain
parsing code to bridge between the transport-style encoded form and
Java. Imagine a non-REST API added to ES. It would have to re-implement
all parsing code in the REST action classes.

Making this reusable would mean to refactor all the protocol parsing
code and reduce the REST API classes to a mininum.

Jörg

Am 27.06.13 08:27, schrieb Dawid Weiss:

> Thanks for the hints, Martijn.
>
>> I think the parseSearchRequest method can be made protected.
> For me it'd have to be public -- I don't want to subclass, I want to
> keep my stuff separate and just delegate parsing of a fragment of my
> request that I know is a search request. It'd be nice if it could be
> made reusable I guess.
>
> Dawid
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.