Greetings!

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Greetings!

Jake Mannix
Hi Shay et al.

  Awesome website you've got there - really gets me interested in
trying out this project.  Your documentation gives a great feel for
how ElasticSearch can be used.  Many of my questions can be answered
by digging into the source code, but I was wondering if you could give
a little overview (or point to the relevant docs) on how you do a) the
faceted search part, and b) in what way is the search "real-time".  As
someone who's worked quite a bit on getting distributed faceted real-
time search on top of Lucene to perform well ( see: http://zoie.googlecode.com
and http://bobo-browse.googlecode.com ), I'm interested to see what
ElasticSearch's approach was!

  -jake
  ---
  http://www.linkedin.com/in/jakemannix
  http://www.twitter.com/pbrane
Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

kimchy
Administrator
Hi Jake,

   Thanks for the compliments!, I invested quite a bit of time on the site, and its really great to get positive feedback on it. Regarding your questions:

a) facets in ES currently only support facet queries. It basically revolves around filters that represent the facet query (cached on the index reader level). Its the most straightforward solution that I wanted to implement to get something out there (and I think the most flexible facet solution out of all the rest). The nice bit about caching is the fact that the index is sharded, so memory is (potentially) not a problem since you can simply fire up more nodes. I plan to add support for other types of facets (and make the search process completely pluggable).

b) The search is near real time with the current (only) implementation of the Engine (http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/) and it uses Lucene NRT. I got back good results with that with a near real time factor of 1 second (you see changes maximum 1 second after they were indexed) and Lucene 3.1 should be even better. There are other ways to implement real time, one of them, which I have done in Compass ages ago is to have an in memory index and a "more persistent" index, and do the ops on the in memory index one. NRT might still be used to get the changes from the in memory index, but its on a smaller index so the sync points there will potentially be smaller (in terms of time spent). The nice thing about ES is the fact that it has a transaction log, so you don't need to commit in any case, which makes the in memory index even better solution since you don't potentially loose operations. I have planned for such solution(s) upfront with the Engine abstraction I have in ES.

Hope this answers the question a bit. Its such a broad area ... .

-shay.banon

On Tue, Feb 9, 2010 at 1:22 AM, jake.mannix <[hidden email]> wrote:
Hi Shay et al.

 Awesome website you've got there - really gets me interested in
trying out this project.  Your documentation gives a great feel for
how ElasticSearch can be used.  Many of my questions can be answered
by digging into the source code, but I was wondering if you could give
a little overview (or point to the relevant docs) on how you do a) the
faceted search part, and b) in what way is the search "real-time".  As
someone who's worked quite a bit on getting distributed faceted real-
time search on top of Lucene to perform well ( see: http://zoie.googlecode.com
and http://bobo-browse.googlecode.com ), I'm interested to see what
ElasticSearch's approach was!

 -jake
 ---
 http://www.linkedin.com/in/jakemannix
 http://www.twitter.com/pbrane

Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

Jake Mannix


On Mon, Feb 8, 2010 at 4:12 PM, Shay Banon <[hidden email]> wrote:
Hi Jake,

   Thanks for the compliments!, I invested quite a bit of time on the site, and its really great to get positive feedback on it. Regarding your questions:

a) facets in ES currently only support facet queries. It basically revolves around filters that represent the facet query (cached on the index reader level). Its the most straightforward solution that I wanted to implement to get something out there (and I think the most flexible facet solution out of all the rest). The nice bit about caching is the fact that the index is sharded, so memory is (potentially) not a problem since you can simply fire up more nodes. I plan to add support for other types of facets (and make the search process completely pluggable).

Filters keyed on indexreader, ok, fairly straightforward (although if you want to do multi-select, this will get tricky: if the user selects "color:red" AND "month:Jan", then you want to filter by both of them for the search results, but also collect the number of hits on the other colors (as long as month:Jan matches), and the number of hits on the other months (as long as the color:red matches), etc...).  

How to you expect the caching will work if you were indexing in real-time though? The DocIdSetIterator is cached on the per IndexReader, not down at the SegmentReader, right?  Then when you do a reopen/IndexWriter.getReader(), does this stay up to date?  
 
b) The search is near real time with the current (only) implementation of the Engine (http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/) and it uses Lucene NRT. I got back good results with that with a near real time factor of 1 second (you see changes maximum 1 second after they were indexed) and Lucene 3.1 should be even better. There are other ways to implement real time, one of them, which I have done in Compass ages ago is to have an in memory index and a "more persistent" index, and do the ops on the in memory index one. NRT might still be used to get the changes from the in memory index, but its on a smaller index so the sync points there will potentially be smaller (in terms of time spent). The nice thing about ES is the fact that it has a transaction log, so you don't need to commit in any case, which makes the in memory index even better solution since you don't potentially loose operations. I have planned for such solution(s) upfront with the Engine abstraction I have in ES.

Ok, NRT with 1 second turnaround is pretty good, esp since it's distributed (have you done much performance analysis under load?).  If you want to do the partial RAMDir / FSDir thing, you should check out zoie, it's also apache licensed, and takes care of all of that stuff as its core focus (including optimized segment mergers for the realtime case, and a docid<->uid mapping), and works *best* in cases where there is a transaction log.  The indexing paradigm is one of StreamDataProvider / DataConsumer - you hook in your data provider (fed by, eg. your txlog), and zoie provides a DataConsumer which indexes in real time, exposing an IndexReaderFactory which gives you a handle on a List<R extends IndexReader> getReaders() which is real-time up to the couple-of-milliseconds level.  Should be pretty easy to plug in if you wanted to use it.
 

Hope this answers the question a bit. Its such a broad area ... .


It is, I'm interested to try out ES and see how you got it working!   Very cool stuff!

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

kimchy
Administrator
I see what you mean now. In a simple facet usage, then each time a facet is clicked, the facet is added as a filter to the query (later on becoming a boolean filter). But, in this case, the results that you get will always be narrows down to the query with the filters, so getting filters on just the query (stuff) and not the filter (color:red) is not possible.

So, in order to do that, you will need, for the ones the go outside of the faceted filtering, execute another count search for the facets you want with the original queries, which results in unnecessary calls.

This is a nice scenario, and can be solved quite easily actually by adding to the facet query the ability to override which query it facets on (so some facets will run on the "master" query, which is "stuff", and others will run on the filtered query). This solution is heavily based on the fact that filters are easily cached, so you have the docidsets in memory already.

I can have a look at bobo browse to see what you are doing, wouldn't mind trying to get its facet support instead of reimplemting it myself. There are some important ground features that I don't want to loose with facets, and the most important one is to be able to define them dynamically (i.e. per request there can be different facets) and not define them upfront.

Cheers,
Shay

On Tue, Feb 9, 2010 at 11:28 PM, Jake Mannix <[hidden email]> wrote:


On Tue, Feb 9, 2010 at 12:47 PM, Shay Banon <[hidden email]> wrote:
Filters keyed on indexreader, ok, fairly straightforward (although if you want to do multi-select, this will get tricky: if the user selects "color:red" AND "month:Jan", then you want to filter by both of them for the search results, but also collect the number of hits on the other colors (as long as month:Jan matches), and the number of hits on the other months (as long as the color:red matches), etc...).  

Not sure I understand, you can wrap a query with a filter, and then use that. You will get the count (restricted to the query you ran) of "color:red AND month:Jan". Unless you mean that you want to get counts for color:red and also counts for month:Jan, in this case you simply have two facet queries.

Here's what I mean:  if you are displaying facet information for both color and month, you can let people select from both, so that the results returned are filtered, as you say, by "color:red AND month:Jan", that is great.  But let's look at what pieces of info the user should have:  At first, they have added no facet filters to query "stuff", and we return all matches for "stuff", the total count("stuff") as well as some facet data:

{color : 
  {red : count("stuff AND color:red") }, 
  {blue : count("stuff AND color:blue") }, 
  {green : count("stuff AND color:green") } 
},
{month: 
  {jan : count("stuff AND month:jan") }, 
  {feb : count("stuff AND month:feb") }, 
  {mar : count("stuff AND month:mar") } 
}

Now they click on color:red, and we return all the matches for "stuff AND color:red", along with count("stuff AND color:red"), and facet data:

{color :
  {red : count("stuff AND color:red") /* this link won't be clickable because we're here already */},  
  {blue : count("stuff AND color:blue") /* this link _is_clickable, and can applies the filter "color:blue OR color:red" */},
  {green : count("stuff AND color:green") /* as with color:blue above */}
},
{month :
  {jan : count("stuff AND color:red AND month:jan") },
  {feb : count("stuff AND color:red AND month:feb") },
  {mar : count("stuff AND color:red AND month:mar") }
}

The counts for color *without* red being applied should be returned because we may want to allow users to be able to select a couple of facet values OR'ed together (within a field - filters across fields are AND'ed, as usual).  

Now comes the tricky part, the users clicks on "month:jan", and we return results filtered by "stuff AND color:red AND month:jan", along with count("stuff AND color:red AND month:jan"), and facet data:

{color :
  {red : count("stuff AND color:red AND month:jan") /* this link won't be clickable because we're here already */},  
  {blue : count("stuff AND color:blue AND month:jan") /* _is_clickable, and switches the filter to "month:jan AND (color:blue OR color:red)" */},
  {green : count("stuff AND color:green") /* as with color:blue above */}
},
{month :
  {jan : count("stuff AND color:red AND month:jan") /* no longer clickable, we're here already */ },
  {feb : count("stuff AND color:red AND month:feb") /* _is_ clickable, and switches the filter to "(month:jan OR month:feb) AND color:red" */ },
  {mar : count("stuff AND color:red AND month:mar") /* similar to month:feb above */ }
}

This is what the user expects from faceted search, in the ui, but I'm pretty sure that the way Solr computes this, is as you say - by executing multiple facet queries, but that is horribly inefficient (esp as the number of fields to facet on grows) - it's much nicer if you can return all of these counts in *one* request, it just requires some work to do it efficiently (this is what we do in bobo-browse).

  -jake

Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

Jake Mannix

On Wed, Feb 10, 2010 at 3:00 AM, Shay Banon <[hidden email]> wrote:
I see what you mean now. In a simple facet usage, then each time a facet is clicked, the facet is added as a filter to the query (later on becoming a boolean filter). But, in this case, the results that you get will always be narrows down to the query with the filters, so getting filters on just the query (stuff) and not the filter (color:red) is not possible.

So, in order to do that, you will need, for the ones the go outside of the faceted filtering, execute another count search for the facets you want with the original queries, which results in unnecessary calls.

Exactly.  I'm pretty sure this is what Solr does too, and it's not scalable to large numbers of facets.
 
This is a nice scenario, and can be solved quite easily actually by adding to the facet query the ability to override which query it facets on (so some facets will run on the "master" query, which is "stuff", and others will run on the filtered query). This solution is heavily based on the fact that filters are easily cached, so you have the docidsets in memory already.

Even having the docIdSets in memory, it's tricky to be able to do all the counting you need in one traversal of the master query's hit list (well, you don't need do traverse the *whole* thing if you have facets selected on two or more fields, but still).  It's not rocket science, but yeah, you need to keep track of the largest set of docIds which could contribute to a count as you walk (if you've got color:red and date:feb both selected, then you need to walk the docs which match (stuff AND (color:red OR date:feb)), and on each doc, determine whether (color:red AND date:feb) (so it's an actual hit to be collected), or else it only matches one of them (in which case if it matches color:red but date:jan instead of date:feb, you need to *not* do a real "collect()", but you do want to increment the counter for date:jan). 
 
I can have a look at bobo browse to see what you are doing, wouldn't mind trying to get its facet support instead of reimplemting it myself. There are some important ground features that I don't want to loose with facets, and the most important one is to be able to define them dynamically (i.e. per request there can be different facets) and not define them upfront.

Dynamic facets in bobo-browse are built out of what we call a RuntimeFacetHandler, which can be built on top of other FacetHandlers (if for example, you want a dynamic facet, which could for example be faceting based on intersection with a generic query (QueryWrapperFilter).  It won't be as efficient as a static facet field, because it would need to set itself up at query time (caching would help, of course), instead of IndexReader load time (which is what static facets do: a full-forward lookup of the facet values for the static fields is completely loaded in the background at load time, so that everything is available in memory at query time).

Is that the kind of feature you wanted to make sure was there, or is it something else you were referring to?

  -jake

Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

kimchy
Administrator


On Thu, Feb 11, 2010 at 12:43 AM, Jake Mannix <[hidden email]> wrote:

On Wed, Feb 10, 2010 at 3:00 AM, Shay Banon <[hidden email]> wrote:
I see what you mean now. In a simple facet usage, then each time a facet is clicked, the facet is added as a filter to the query (later on becoming a boolean filter). But, in this case, the results that you get will always be narrows down to the query with the filters, so getting filters on just the query (stuff) and not the filter (color:red) is not possible.

So, in order to do that, you will need, for the ones the go outside of the faceted filtering, execute another count search for the facets you want with the original queries, which results in unnecessary calls.

Exactly.  I'm pretty sure this is what Solr does too, and it's not scalable to large numbers of facets.
 
This is a nice scenario, and can be solved quite easily actually by adding to the facet query the ability to override which query it facets on (so some facets will run on the "master" query, which is "stuff", and others will run on the filtered query). This solution is heavily based on the fact that filters are easily cached, so you have the docidsets in memory already.

Even having the docIdSets in memory, it's tricky to be able to do all the counting you need in one traversal of the master query's hit list (well, you don't need do traverse the *whole* thing if you have facets selected on two or more fields, but still).  It's not rocket science, but yeah, you need to keep track of the largest set of docIds which could contribute to a count as you walk (if you've got color:red and date:feb both selected, then you need to walk the docs which match (stuff AND (color:red OR date:feb)), and on each doc, determine whether (color:red AND date:feb) (so it's an actual hit to be collected), or else it only matches one of them (in which case if it matches color:red but date:jan instead of date:feb, you need to *not* do a real "collect()", but you do want to increment the counter for date:jan). 
 
I can have a look at bobo browse to see what you are doing, wouldn't mind trying to get its facet support instead of reimplemting it myself. There are some important ground features that I don't want to loose with facets, and the most important one is to be able to define them dynamically (i.e. per request there can be different facets) and not define them upfront.

Dynamic facets in bobo-browse are built out of what we call a RuntimeFacetHandler, which can be built on top of other FacetHandlers (if for example, you want a dynamic facet, which could for example be faceting based on intersection with a generic query (QueryWrapperFilter).  It won't be as efficient as a static facet field, because it would need to set itself up at query time (caching would help, of course), instead of IndexReader load time (which is what static facets do: a full-forward lookup of the facet values for the static fields is completely loaded in the background at load time, so that everything is available in memory at query time).

Is that the kind of feature you wanted to make sure was there, or is it something else you were referring to?

Yep, thats what I was talking about. How embeddable is Bobo browse (and is there a chance to get Spring out of it :) )? 
 

  -jake


Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

Jake Mannix


On Wed, Feb 10, 2010 at 2:47 PM, Shay Banon <[hidden email]> wrote:
Is that the kind of feature you wanted to make sure was there, or is it something else you were referring to?

Yep, thats what I was talking about. How embeddable is Bobo browse (and is there a chance to get Spring out of it :) )? 

Excellent.  Bobo is easily embeddable - it's what it's for!  Spring is a completely optional dependency, you can instantiate your FacetHandlerFactories directly in code, or contribute a patch which gets Guice in there (we'd love to give our users that option too!).  Spring was just for convenience, and because many of us use it (and it was all there was 3 years ago!).

I guess I could be wrong, spring might be required at build time, but you don't need to use it... we should fix that, because it's not integral in any way.  

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Greetings!

kimchy
Administrator


On Thu, Feb 11, 2010 at 12:53 AM, Jake Mannix <[hidden email]> wrote:


On Wed, Feb 10, 2010 at 2:47 PM, Shay Banon <[hidden email]> wrote:
Is that the kind of feature you wanted to make sure was there, or is it something else you were referring to?

Yep, thats what I was talking about. How embeddable is Bobo browse (and is there a chance to get Spring out of it :) )? 

Excellent.  Bobo is easily embeddable - it's what it's for!  Spring is a completely optional dependency, you can instantiate your FacetHandlerFactories directly in code, or contribute a patch which gets Guice in there (we'd love to give our users that option too!).  Spring was just for convenience, and because many of us use it (and it was all there was 3 years ago!).

I guess I could be wrong, spring might be required at build time, but you don't need to use it... we should fix that, because it's not integral in any way.  

Cool. I will have a look at it and see how it goes once I get some other major features that I want to add to 0.5.0 out of the way :)
 

  -jake