Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Paolo Castagna
Is there a fair and not biased comparison in terms of features and
performances between Solr [1], ElasticSearch [2] and Sensei [3]?

Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and ElasticSearch and Sensei performances using the same hardware?

More specifically, I am searching help and advices on:

  - what dataset(s), publicly available, I could use
  - what fields/schema (tokenizers, analyzers, etc.) I should use
  - a common set of queries
  - help with tools (JMeter?, something else?)
  - help with configuration (in particularly with Sensei, since
    it's the one I have less familiarity with)
  - ...

More general questions:

Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?

Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?

Sorry for the crossposting, but, in this case, I think it's appropriate.

Thanks,
Paolo

  [1] http://lucene.apache.org/solr/
  [2] http://www.elasticsearch.com/
  [3] http://sna-projects.com/sensei/
  [4]
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/
Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Lukáš Vlček
Hi,

Do not know anything about Sensei but given how fast Elasticsearch and also Solr are changing and developing then I would not expect there is any easy answer to this question. Also I am not sure Solr and Elasticsearch are using the exact same version of Lucene all the time (and this can be quite important). And not only performance of the search server should be the main criteria (how about deployment and maintenance of the server!). I think it is very hard to produce simple fair and general metrics to compare all three search servers. Feature matrix can make sense but again I think it is not all just about features.

If you need to find the best solution for your project then I would recommend doing some evaluation, i.e.: taking your data (or sample of it), index it, put expected load on your search server, try moving index from one server to other (or similar emulation of production use cases), try restoring the index from source data (emulation of crash) ... etc ... and then you can easily find what best suits your needs.

Regards,
Lukas

On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna <[hidden email]> wrote:
Is there a fair and not biased comparison in terms of features and
performances between Solr [1], ElasticSearch [2] and Sensei [3]?

Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and ElasticSearch and Sensei performances using the same hardware?

More specifically, I am searching help and advices on:

 - what dataset(s), publicly available, I could use
 - what fields/schema (tokenizers, analyzers, etc.) I should use
 - a common set of queries
 - help with tools (JMeter?, something else?)
 - help with configuration (in particularly with Sensei, since
  it's the one I have less familiarity with)
 - ...

More general questions:

Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?

Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?

Sorry for the crossposting, but, in this case, I think it's appropriate.

Thanks,
Paolo

 [1] http://lucene.apache.org/solr/
 [2] http://www.elasticsearch.com/
 [3] http://sna-projects.com/sensei/
 [4] http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/

Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Paolo Castagna
Hi Lukáš,
first of all, thanks for your reply.

Lukáš Vlček wrote:
> Do not know anything about Sensei but given how fast Elasticsearch and
> also Solr are changing and developing then I would not expect there is
> any easy answer to this question. Also I am not sure Solr and
> Elasticsearch are using the exact same version of Lucene all the time
> (and this can be quite important).

If there was an easy answer, I wouldn't asked for advice or help. :-)


According to Solr's pom.xml [1], the latest-stable release of Solr
(v1.4) uses Lucene v2.9.1.

According to ElastsicSearch's pom.xml [2], the latest release of
ElasticSearch (v0.6.0) uses Lucene v3.0.1.

According to Sensei's ivy.xml [3], the development version of
Sensei uses Lucene v3.0.0.


The fact that these projects are using different (sometimes
only slightly different) versions of Lucene is important in
term of "fairness" for a comparison, but it does not diminish
the value for users of having a "common" easy way to benchmark
these projects.

Are there big differences, in term of performances, between
Lucene v2.9.1 and Lucene v3.0.1?

Users could also use the Lucene benchmark contrib to compare
different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
ElasticSearch and Sensei.

Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.


  [1]
http://repo1.maven.org/maven2/org/apache/solr/solr-core/1.4.0/solr-core-1.4.0.pom
  [2]
http://oss.sonatype.org/content/repositories/releases/org/elasticsearch/elasticsearch/0.6.0/elasticsearch-0.6.0.pom
  [3] http://github.com/javasoze/sensei/blob/master/ivy.xml


 > And not only performance of the
> search server should be the main criteria (how about deployment and
> maintenance of the server!). I think it is very hard to produce simple
> fair and general metrics to compare all three search servers. Feature
> matrix can make sense but again I think it is not all just about features.

I didn't claimed performances should be the only or main criteria.

But, I think it's valuable for the users to be able to easily compare
performances, if they want/need to. I'd like to be able to do so.

What columns would you put on a feature matrix?

> If you need to find the best solution for your project then I would
> recommend doing some evaluation, i.e.: taking your data (or sample of
> it), index it, put expected load on your search server, try moving index
> from one server to other (or similar emulation of production use cases),
> try restoring the index from source data (emulation of crash) ... etc
> ... and then you can easily find what best suits your needs.

Everybody need to find the best solution for their projects.
Everybody will use different criteria.

It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.

Would email archives from Apache Software Foundation or W3C be a good
dataset? I could use Tika for parsing mbox files.

I don't disagree with anything you wrote, I still think that providing
people an easy way to run the same benchmark over Solr, ElasticSearch
and Sensei is valuable.

So, I am still searching for advices, suggestions and help.

Thanks again for your reply,
Paolo

>
> Regards,
> Lukas
>
> On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna
> <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>     Is there a fair and not biased comparison in terms of features and
>     performances between Solr [1], ElasticSearch [2] and Sensei [3]?
>
>     Is there someone interested in helping (just advices on what would
>     need to be done is fine!) to construct a "common" benchmark, perhaps
>     using JMeter?, which would allow people to easily and quickly compare
>     Solr and ElasticSearch and Sensei performances using the same hardware?
>
>     More specifically, I am searching help and advices on:
>
>      - what dataset(s), publicly available, I could use
>      - what fields/schema (tokenizers, analyzers, etc.) I should use
>      - a common set of queries
>      - help with tools (JMeter?, something else?)
>      - help with configuration (in particularly with Sensei, since
>       it's the one I have less familiarity with)
>      - ...
>
>     More general questions:
>
>     Is there a simple and pragmatic benchmark for "information retrieval"
>     systems (please, don't point me at TREC, see: simple and pragmatic)?
>
>     Since, all these projects are using Lucene, could the Lucene benchmark
>     contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?
>
>     Sorry for the crossposting, but, in this case, I think it's appropriate.
>
>     Thanks,
>     Paolo
>
>      [1] http://lucene.apache.org/solr/
>      [2] http://www.elasticsearch.com/
>      [3] http://sna-projects.com/sensei/
>      [4]
>     http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Sergio Bossa
On Thu, Apr 15, 2010 at 11:15 AM, Paolo Castagna
<[hidden email]> wrote:

> It would be good to have a common/sharable/public dataset that others
> can (re)use to run the same benchmark and, eventually, if they want
> compare different software on the same hardware or different hardware
> using the same software.

Paolo, you may want to take a look at the recently publicly released
Yahoo Firehose:
http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html

Hope that helps,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Paolo Castagna
Sergio Bossa wrote:

> On Thu, Apr 15, 2010 at 11:15 AM, Paolo Castagna
> <[hidden email]> wrote:
>
>> It would be good to have a common/sharable/public dataset that others
>> can (re)use to run the same benchmark and, eventually, if they want
>> compare different software on the same hardware or different hardware
>> using the same software.
>
> Paolo, you may want to take a look at the recently publicly released
> Yahoo Firehose:
> http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html

Thank you Sergio, it's an interesting but I don't see how I could use
that service to build a dataset that I could then use for a benchmark.

I am not even sure if the "Terms for Use" would allow me to download
a large chunk of it and/or re-publish it, as it is or in a different
format somewhere else.

Ideally, we would need a stable and re-publishable dataset.

In a previous email, I proposed email archives from Apache Software
Foundation simply because they are available in mbox format and should
be freely usable and re-publishable.

I'll wait for other ideas/suggestions before deciding on the dataset.

Once we have a dataset, we can discuss/agree on
fields/schemas/tokenizers/analyzers etc., then queries,
then tools to generate the load and drive the benchmark,
then actual operations we want to perform/benchmark.
I was thinking these broad categories, initially: search with
small number of results, search with large number of results,
bulk/batch indexing into an empty index, index updates.

> Hope that helps,

Thanks for your suggestion,
Paolo

> Cheers,
>
> Sergio B.
>

Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Lukáš Vlček
In reply to this post by Paolo Castagna


On Thu, Apr 15, 2010 at 11:15 AM, Paolo Castagna <[hidden email]> wrote:
Hi Lukáš,
first of all, thanks for your reply.


Lukáš Vlček wrote:
Do not know anything about Sensei but given how fast Elasticsearch and also Solr are changing and developing then I would not expect there is any easy answer to this question. Also I am not sure Solr and Elasticsearch are using the exact same version of Lucene all the time (and this can be quite important).

If there was an easy answer, I wouldn't asked for advice or help. :-)


According to Solr's pom.xml [1], the latest-stable release of Solr
(v1.4) uses Lucene v2.9.1.

According to ElastsicSearch's pom.xml [2], the latest release of
ElasticSearch (v0.6.0) uses Lucene v3.0.1.

According to Sensei's ivy.xml [3], the development version of
Sensei uses Lucene v3.0.0.


The fact that these projects are using different (sometimes
only slightly different) versions of Lucene is important in
term of "fairness" for a comparison, but it does not diminish
the value for users of having a "common" easy way to benchmark
these projects.

Are there big differences, in term of performances, between
Lucene v2.9.1 and Lucene v3.0.1?


I am not that familiar with Lucene internals but there can be performance gains and new features which were implemented in search server layer in previous version.
http://lucene.apache.org/java/3_0_0/changes/Changes.html
http://lucene.apache.org/java/3_0_1/changes/Changes.html
I just wanted to point out that you need to take account on this fact if you want to compare apples and apples.
 
Users could also use the Lucene benchmark contrib to compare
different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
ElasticSearch and Sensei.

Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.

I would welcome such benchmark and I believe search server developers would welcome it as well. Indexing Apache mail list is not bad idea I think.
 
 [1] http://repo1.maven.org/maven2/org/apache/solr/solr-core/1.4.0/solr-core-1.4.0.pom
 [2] http://oss.sonatype.org/content/repositories/releases/org/elasticsearch/elasticsearch/0.6.0/elasticsearch-0.6.0.pom
 [3] http://github.com/javasoze/sensei/blob/master/ivy.xml



> And not only performance of the
search server should be the main criteria (how about deployment and maintenance of the server!). I think it is very hard to produce simple fair and general metrics to compare all three search servers. Feature matrix can make sense but again I think it is not all just about features.

I didn't claimed performances should be the only or main criteria.

But, I think it's valuable for the users to be able to easily compare
performances, if they want/need to. I'd like to be able to do so.

What columns would you put on a feature matrix?

Hmm.... thinking about it more it seems to me that search servers (especially those based on Lucene) will all converge to the same basic feature set. What will differentiate is how the distributed nature of some features is implemented and the amount of work one has to spend on maintenance, updates, fixes, development ... and of course when it comes to "enterprise business" if you can buy reliable support for it.
 
 

If you need to find the best solution for your project then I would recommend doing some evaluation, i.e.: taking your data (or sample of it), index it, put expected load on your search server, try moving index from one server to other (or similar emulation of production use cases), try restoring the index from source data (emulation of crash) ... etc ... and then you can easily find what best suits your needs.

Everybody need to find the best solution for their projects.
Everybody will use different criteria.

It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.

Would email archives from Apache Software Foundation or W3C be a good
dataset? I could use Tika for parsing mbox files.

I don't disagree with anything you wrote, I still think that providing
people an easy way to run the same benchmark over Solr, ElasticSearch
and Sensei is valuable.

I agree.

How would you like to implement those benchmarks? A new project somewhere on google code or github containing prebuild instances of individual search servers, copy of benchmark dataset (mbox data if taken from Apache) and wiki page with tesults created by volunteers testing this on various OS and HW configurations?
 

So, I am still searching for advices, suggestions and help.

Thanks again for your reply,
Paolo


Regards,
Lukas


On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna <[hidden email] <mailto:[hidden email]>> wrote:

   Is there a fair and not biased comparison in terms of features and
   performances between Solr [1], ElasticSearch [2] and Sensei [3]?

   Is there someone interested in helping (just advices on what would
   need to be done is fine!) to construct a "common" benchmark, perhaps
   using JMeter?, which would allow people to easily and quickly compare
   Solr and ElasticSearch and Sensei performances using the same hardware?

   More specifically, I am searching help and advices on:

    - what dataset(s), publicly available, I could use
    - what fields/schema (tokenizers, analyzers, etc.) I should use
    - a common set of queries
    - help with tools (JMeter?, something else?)
    - help with configuration (in particularly with Sensei, since
     it's the one I have less familiarity with)
    - ...

   More general questions:

   Is there a simple and pragmatic benchmark for "information retrieval"
   systems (please, don't point me at TREC, see: simple and pragmatic)?

   Since, all these projects are using Lucene, could the Lucene benchmark
   contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?

   Sorry for the crossposting, but, in this case, I think it's appropriate.

   Thanks,
   Paolo

    [1] http://lucene.apache.org/solr/
    [2] http://www.elasticsearch.com/
    [3] http://sna-projects.com/sensei/
    [4]
   http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/




Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Paolo Castagna
Lukáš Vlček wrote:
> I am not that familiar with Lucene internals but there can be
> performance gains and new features which were implemented in search
> server layer in previous version.
> http://lucene.apache.org/java/3_0_0/changes/Changes.html
> http://lucene.apache.org/java/3_0_1/changes/Changes.html
> I just wanted to point out that you need to take account on this fact if
> you want to compare apples and apples.

I agree.

Hopefully, with time Solr, ElasticSearch and Sensei will use the
same/latest stable Lucene release.

Upgrading Lucene from v2.9.x to v3.0.x is more expensive (since
deprecated things have been removed) but once a project is using
v3.0.x it should not be that difficult to upgrade.

>     Users could also use the Lucene benchmark contrib to compare
>     different versions of Lucene and therefore have an idea of
>     the effect that this might have on the benchmark for Solr,
>     ElasticSearch and Sensei.
>
>     Finally, if the benchmark is easy to use, we could use it to
>     benchmark latest/stable releases as well as development
>     trunks/versions.
>
>
> I would welcome such benchmark and I believe search server developers
> would welcome it as well. Indexing Apache mail list is not bad idea I think.

Good.

>     What columns would you put on a feature matrix?
>
>
> Hmm.... thinking about it more it seems to me that search servers
> (especially those based on Lucene) will all converge to the same basic
> feature set. What will differentiate is how the distributed nature of
> some features is implemented and the amount of work one has to spend on
> maintenance, updates, fixes, development ... and of course when it comes
> to "enterprise business" if you can buy reliable support for it.

So, if I need to extract some column names from what you are saying...

  - basic features
     - ...
  - distribution/clustering
  - maintenance cost
  - upgrade/update cost
  - commercial support
  - ...

> How would you like to implement those benchmarks? A new project
> somewhere on google code or github containing prebuild instances of
> individual search servers, copy of benchmark dataset (mbox data if taken
> from Apache) and wiki page with tesults created by volunteers testing
> this on various OS and HW configurations?

Infrastructure (i.e. Google Code or GitHub) is not a problem.

The benchmark should be easy to use. So, ideally, an user should
checkout the benchmark with Solr, ElasticSearch and Sensei configured
and read to run.

The benchmark must include a copy of the dataset.

Ideally, the benchmark should produce reports in the same format so
that people can easily share and compare results.

Paolo
Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Lukáš Vlček


On Thu, Apr 15, 2010 at 4:18 PM, Paolo Castagna <[hidden email]> wrote:
Lukáš Vlček wrote:
I am not that familiar with Lucene internals but there can be performance gains and new features which were implemented in search server layer in previous version.
http://lucene.apache.org/java/3_0_0/changes/Changes.html
http://lucene.apache.org/java/3_0_1/changes/Changes.html
I just wanted to point out that you need to take account on this fact if you want to compare apples and apples.

I agree.

Hopefully, with time Solr, ElasticSearch and Sensei will use the
same/latest stable Lucene release.

Upgrading Lucene from v2.9.x to v3.0.x is more expensive (since
deprecated things have been removed) but once a project is using
v3.0.x it should not be that difficult to upgrade.


   Users could also use the Lucene benchmark contrib to compare
   different versions of Lucene and therefore have an idea of
   the effect that this might have on the benchmark for Solr,
   ElasticSearch and Sensei.

   Finally, if the benchmark is easy to use, we could use it to
   benchmark latest/stable releases as well as development
   trunks/versions.


I would welcome such benchmark and I believe search server developers would welcome it as well. Indexing Apache mail list is not bad idea I think.

Good.


   What columns would you put on a feature matrix?


Hmm.... thinking about it more it seems to me that search servers (especially those based on Lucene) will all converge to the same basic feature set. What will differentiate is how the distributed nature of some features is implemented and the amount of work one has to spend on maintenance, updates, fixes, development ... and of course when it comes to "enterprise business" if you can buy reliable support for it.

So, if I need to extract some column names from what you are saying...

 - basic features
   - ...
 - distribution/clustering
 - maintenance cost
 - upgrade/update cost
 - commercial support
 - ...


How would you like to implement those benchmarks? A new project somewhere on google code or github containing prebuild instances of individual search servers, copy of benchmark dataset (mbox data if taken from Apache) and wiki page with tesults created by volunteers testing this on various OS and HW configurations?

Infrastructure (i.e. Google Code or GitHub) is not a problem.

The benchmark should be easy to use. So, ideally, an user should
checkout the benchmark with Solr, ElasticSearch and Sensei configured
and read to run.

Running benchmark just on single machine is probably not that useful. What would be more important is ability to run benchmark on more nodes in parallel (like distributed search). I am not sure how hard it would be to deliver easy-to-deploy artefacts for the purpose of the distributed benchmark. Also the benchmark should report as much information about network as possible.
 

The benchmark must include a copy of the dataset.

Ideally, the benchmark should produce reports in the same format so
that people can easily share and compare results.

Paolo

Reply | Threaded
Open this post in threaded view
|

Re: Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Paolo Castagna
Lukáš Vlček wrote:
>     The benchmark should be easy to use. So, ideally, an user should
>     checkout the benchmark with Solr, ElasticSearch and Sensei configured
>     and read to run.
>
>
> Running benchmark just on single machine is probably not that useful.

Agree.

What I usually do is to have a sort of "template" for the distribution
to put on each node of my cluster and then separate configuration to
setup nodes differently if necessary.

> What would be more important is ability to run benchmark on more nodes
> in parallel (like distributed search). I am not sure how hard it would
> be to deliver easy-to-deploy artefacts for the purpose of the
> distributed benchmark.

Agree. Not trivial, but possible.

If there are Puppet guru around, we could even automate deployment and
configuration using Puppet. Or perhaps, just simple rsync scripts are
fine to start with.

> Also the benchmark should report as much information about network as possible.

True, often (in particular when results are cached in RAM and/or disks
are not used) network transfer is a significant part of the response
time.

But, although important, I would leave monitoring network traffic out
for the moment, for simplicity. Having a benchmark to run is the first
step.

Paolo




Reply | Threaded
Open this post in threaded view
|

Please consider also HBase/Cassandra+Lucene

Thomas Koch
Hi,

if you want to make a benchmark about search solutions over big data, you may
want to also consider the following solutions:

http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend

Lucandra uses cassandra as the persistent layer for lucene. The following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
http://github.com/akkumar/hbasene

Both of the above are only proof of concepts by now, but may quickly become
production ready. In the end you only need to like 5 classes to glue lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro
Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

timrobertson100
+1 from me.  Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves


On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:

> Hi,
>
> if you want to make a benchmark about search solutions over big data, you may
> want to also consider the following solutions:
>
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>
> Lucandra uses cassandra as the persistent layer for lucene. The following two
> attempts have ported lucandra to use HBase:
>
> http://github.com/thkoch2001/lucehbase
> http://github.com/akkumar/hbasene
>
> Both of the above are only proof of concepts by now, but may quickly become
> production ready. In the end you only need to like 5 classes to glue lucene
> with cassandra or HBase.
>
> Best regards,
>
> Thomas Koch, http://www.koch.ro
>
Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

kimchy
Administrator
This solution is problematic IMO in how it works and especially with how Lucene works. With this solution, there is only a single Lucene IndexWriter that you can open, so your writes don't scale, regardless of the amount of machines you add. 

Also, Lucene cached a *lot* of information per reader/searcher (fieldcache, terms info, and so on). With large indices, you have a single reader working against a very large cluster/index, and your client won't cope with it.... .

You can't get around it, you need to shard a Lucene index into many small Lucene indices running on different machines, but then you need to write a distributed Lucene solution. And hey, I think someone already built one :)

(p.s. I am not even mentioning all the many other features elasticsearch gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson <[hidden email]> wrote:
+1 from me.  Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves


On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
> Hi,
>
> if you want to make a benchmark about search solutions over big data, you may
> want to also consider the following solutions:
>
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>
> Lucandra uses cassandra as the persistent layer for lucene. The following two
> attempts have ported lucandra to use HBase:
>
> http://github.com/thkoch2001/lucehbase
> http://github.com/akkumar/hbasene
>
> Both of the above are only proof of concepts by now, but may quickly become
> production ready. In the end you only need to like 5 classes to glue lucene
> with cassandra or HBase.
>
> Best regards,
>
> Thomas Koch, http://www.koch.ro
>

Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

timrobertson100
My curiosity was really if I could open many read only index readers
to scale the reading I guess.



On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
<[hidden email]> wrote:

> This solution is problematic IMO in how it works and especially with how
> Lucene works. With this solution, there is only a single Lucene IndexWriter
> that you can open, so your writes don't scale, regardless of the amount of
> machines you add.
> Also, Lucene cached a *lot* of information per reader/searcher (fieldcache,
> terms info, and so on). With large indices, you have a single reader working
> against a very large cluster/index, and your client won't cope with it.... .
> You can't get around it, you need to shard a Lucene index into many small
> Lucene indices running on different machines, but then you need to write a
> distributed Lucene solution. And hey, I think someone already built one :)
> (p.s. I am not even mentioning all the many other features elasticsearch
> gives over this very low level Lucene solution).
>
> Shay
>
> On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson <[hidden email]>
> wrote:
>>
>> +1 from me.  Very curious of how distributing the storage behind
>> Lucene performs, rather than multiple Lucene indexes distributed
>> themselves
>>
>>
>> On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
>> > Hi,
>> >
>> > if you want to make a benchmark about search solutions over big data,
>> > you may
>> > want to also consider the following solutions:
>> >
>> >
>> > http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>> >
>> > Lucandra uses cassandra as the persistent layer for lucene. The
>> > following two
>> > attempts have ported lucandra to use HBase:
>> >
>> > http://github.com/thkoch2001/lucehbase
>> > http://github.com/akkumar/hbasene
>> >
>> > Both of the above are only proof of concepts by now, but may quickly
>> > become
>> > production ready. In the end you only need to like 5 classes to glue
>> > lucene
>> > with cassandra or HBase.
>> >
>> > Best regards,
>> >
>> > Thomas Koch, http://www.koch.ro
>> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

kimchy
Administrator
This will work up to a point where your index gets big. In this case, your search JVM might not be able to hold all the information needed (for example, won't be able to load the term info and field cache since its too big, and  you will get either OOM or GC trashing).

cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <[hidden email]> wrote:
My curiosity was really if I could open many read only index readers
to scale the reading I guess.



On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
<[hidden email]> wrote:
> This solution is problematic IMO in how it works and especially with how
> Lucene works. With this solution, there is only a single Lucene IndexWriter
> that you can open, so your writes don't scale, regardless of the amount of
> machines you add.
> Also, Lucene cached a *lot* of information per reader/searcher (fieldcache,
> terms info, and so on). With large indices, you have a single reader working
> against a very large cluster/index, and your client won't cope with it.... .
> You can't get around it, you need to shard a Lucene index into many small
> Lucene indices running on different machines, but then you need to write a
> distributed Lucene solution. And hey, I think someone already built one :)
> (p.s. I am not even mentioning all the many other features elasticsearch
> gives over this very low level Lucene solution).
>
> Shay
>
> On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson <[hidden email]>
> wrote:
>>
>> +1 from me.  Very curious of how distributing the storage behind
>> Lucene performs, rather than multiple Lucene indexes distributed
>> themselves
>>
>>
>> On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
>> > Hi,
>> >
>> > if you want to make a benchmark about search solutions over big data,
>> > you may
>> > want to also consider the following solutions:
>> >
>> >
>> > http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>> >
>> > Lucandra uses cassandra as the persistent layer for lucene. The
>> > following two
>> > attempts have ported lucandra to use HBase:
>> >
>> > http://github.com/thkoch2001/lucehbase
>> > http://github.com/akkumar/hbasene
>> >
>> > Both of the above are only proof of concepts by now, but may quickly
>> > become
>> > production ready. In the end you only need to like 5 classes to glue
>> > lucene
>> > with cassandra or HBase.
>> >
>> > Best regards,
>> >
>> > Thomas Koch, http://www.koch.ro
>> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

timrobertson100
Thanks Shay.  Curiosity squashed



On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
<[hidden email]> wrote:

> This will work up to a point where your index gets big. In this case, your
> search JVM might not be able to hold all the information needed (for
> example, won't be able to load the term info and field cache since its too
> big, and  you will get either OOM or GC trashing).
> cheers,
> shay.banon
>
> On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <[hidden email]>
> wrote:
>>
>> My curiosity was really if I could open many read only index readers
>> to scale the reading I guess.
>>
>>
>>
>> On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
>> <[hidden email]> wrote:
>> > This solution is problematic IMO in how it works and especially with how
>> > Lucene works. With this solution, there is only a single Lucene
>> > IndexWriter
>> > that you can open, so your writes don't scale, regardless of the amount
>> > of
>> > machines you add.
>> > Also, Lucene cached a *lot* of information per reader/searcher
>> > (fieldcache,
>> > terms info, and so on). With large indices, you have a single reader
>> > working
>> > against a very large cluster/index, and your client won't cope with
>> > it.... .
>> > You can't get around it, you need to shard a Lucene index into many
>> > small
>> > Lucene indices running on different machines, but then you need to write
>> > a
>> > distributed Lucene solution. And hey, I think someone already built one
>> > :)
>> > (p.s. I am not even mentioning all the many other features elasticsearch
>> > gives over this very low level Lucene solution).
>> >
>> > Shay
>> >
>> > On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> +1 from me.  Very curious of how distributing the storage behind
>> >> Lucene performs, rather than multiple Lucene indexes distributed
>> >> themselves
>> >>
>> >>
>> >> On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> > if you want to make a benchmark about search solutions over big data,
>> >> > you may
>> >> > want to also consider the following solutions:
>> >> >
>> >> >
>> >> >
>> >> > http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>> >> >
>> >> > Lucandra uses cassandra as the persistent layer for lucene. The
>> >> > following two
>> >> > attempts have ported lucandra to use HBase:
>> >> >
>> >> > http://github.com/thkoch2001/lucehbase
>> >> > http://github.com/akkumar/hbasene
>> >> >
>> >> > Both of the above are only proof of concepts by now, but may quickly
>> >> > become
>> >> > production ready. In the end you only need to like 5 classes to glue
>> >> > lucene
>> >> > with cassandra or HBase.
>> >> >
>> >> > Best regards,
>> >> >
>> >> > Thomas Koch, http://www.koch.ro
>> >> >
>> >
>> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

kimchy
Administrator
Hope elasticsearch helps in the area of the squashing :). To be honest, I followed a similar path waay back when I tried to have a distributed Lucene Directory built on top of GigaSpaces/Coherence/Terracotta. From a design perspective, its not very different from what Lucandra or the one over HBase do, and at the end, they suffer from the same limitations that I noted...

cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:49 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay.  Curiosity squashed



On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
<[hidden email]> wrote:
> This will work up to a point where your index gets big. In this case, your
> search JVM might not be able to hold all the information needed (for
> example, won't be able to load the term info and field cache since its too
> big, and  you will get either OOM or GC trashing).
> cheers,
> shay.banon
>
> On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <[hidden email]>
> wrote:
>>
>> My curiosity was really if I could open many read only index readers
>> to scale the reading I guess.
>>
>>
>>
>> On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
>> <[hidden email]> wrote:
>> > This solution is problematic IMO in how it works and especially with how
>> > Lucene works. With this solution, there is only a single Lucene
>> > IndexWriter
>> > that you can open, so your writes don't scale, regardless of the amount
>> > of
>> > machines you add.
>> > Also, Lucene cached a *lot* of information per reader/searcher
>> > (fieldcache,
>> > terms info, and so on). With large indices, you have a single reader
>> > working
>> > against a very large cluster/index, and your client won't cope with
>> > it.... .
>> > You can't get around it, you need to shard a Lucene index into many
>> > small
>> > Lucene indices running on different machines, but then you need to write
>> > a
>> > distributed Lucene solution. And hey, I think someone already built one
>> > :)
>> > (p.s. I am not even mentioning all the many other features elasticsearch
>> > gives over this very low level Lucene solution).
>> >
>> > Shay
>> >
>> > On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> +1 from me.  Very curious of how distributing the storage behind
>> >> Lucene performs, rather than multiple Lucene indexes distributed
>> >> themselves
>> >>
>> >>
>> >> On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> > if you want to make a benchmark about search solutions over big data,
>> >> > you may
>> >> > want to also consider the following solutions:
>> >> >
>> >> >
>> >> >
>> >> > http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>> >> >
>> >> > Lucandra uses cassandra as the persistent layer for lucene. The
>> >> > following two
>> >> > attempts have ported lucandra to use HBase:
>> >> >
>> >> > http://github.com/thkoch2001/lucehbase
>> >> > http://github.com/akkumar/hbasene
>> >> >
>> >> > Both of the above are only proof of concepts by now, but may quickly
>> >> > become
>> >> > production ready. In the end you only need to like 5 classes to glue
>> >> > lucene
>> >> > with cassandra or HBase.
>> >> >
>> >> > Best regards,
>> >> >
>> >> > Thomas Koch, http://www.koch.ro
>> >> >
>> >
>> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

Ori Lahav
Thanks Shay for this point in this thread.

having all the threads around benchmarking. Is there some king of a test that you have done, or someone else done using ElasticSearch and can share numbers regarding scale and performance?

things I would like to get a sense of are:
1. number of documents in the index.
2. how many instances on how many machines.
3. at this size how many Writes/s
4. how many Reads/s

if you have a link to previously done benchmark it would be appriciated - or you can share your numbers.

Thanks
Ori

On Fri, Apr 16, 2010 at 2:36 PM, Shay Banon <[hidden email]> wrote:
Hope elasticsearch helps in the area of the squashing :). To be honest, I followed a similar path waay back when I tried to have a distributed Lucene Directory built on top of GigaSpaces/Coherence/Terracotta. From a design perspective, its not very different from what Lucandra or the one over HBase do, and at the end, they suffer from the same limitations that I noted...

cheers,
shay.banon


On Fri, Apr 16, 2010 at 1:49 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay.  Curiosity squashed



On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
<[hidden email]> wrote:
> This will work up to a point where your index gets big. In this case, your
> search JVM might not be able to hold all the information needed (for
> example, won't be able to load the term info and field cache since its too
> big, and  you will get either OOM or GC trashing).
> cheers,
> shay.banon
>
> On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <[hidden email]>
> wrote:
>>
>> My curiosity was really if I could open many read only index readers
>> to scale the reading I guess.
>>
>>
>>
>> On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
>> <[hidden email]> wrote:
>> > This solution is problematic IMO in how it works and especially with how
>> > Lucene works. With this solution, there is only a single Lucene
>> > IndexWriter
>> > that you can open, so your writes don't scale, regardless of the amount
>> > of
>> > machines you add.
>> > Also, Lucene cached a *lot* of information per reader/searcher
>> > (fieldcache,
>> > terms info, and so on). With large indices, you have a single reader
>> > working
>> > against a very large cluster/index, and your client won't cope with
>> > it.... .
>> > You can't get around it, you need to shard a Lucene index into many
>> > small
>> > Lucene indices running on different machines, but then you need to write
>> > a
>> > distributed Lucene solution. And hey, I think someone already built one
>> > :)
>> > (p.s. I am not even mentioning all the many other features elasticsearch
>> > gives over this very low level Lucene solution).
>> >
>> > Shay
>> >
>> > On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> +1 from me.  Very curious of how distributing the storage behind
>> >> Lucene performs, rather than multiple Lucene indexes distributed
>> >> themselves
>> >>
>> >>
>> >> On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> > if you want to make a benchmark about search solutions over big data,
>> >> > you may
>> >> > want to also consider the following solutions:
>> >> >
>> >> >
>> >> >
>> >> > http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>> >> >
>> >> > Lucandra uses cassandra as the persistent layer for lucene. The
>> >> > following two
>> >> > attempts have ported lucandra to use HBase:
>> >> >
>> >> > http://github.com/thkoch2001/lucehbase
>> >> > http://github.com/akkumar/hbasene
>> >> >
>> >> > Both of the above are only proof of concepts by now, but may quickly
>> >> > become
>> >> > production ready. In the end you only need to like 5 classes to glue
>> >> > lucene
>> >> > with cassandra or HBase.
>> >> >
>> >> > Best regards,
>> >> >
>> >> > Thomas Koch, http://www.koch.ro
>> >> >
>> >
>> >
>
>




--
http://olahav.typepad.com
Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

kimchy
Administrator
Benchmark numbers are very subjective from where you run the benchmark to how you run it. If someone would like to create an unbiased benchmark, I would be happy to help. As for specific usecase, I suggest you write your own and check. If you want, with elasticsearch repo there are some jmeter scripts that I use to test.

cheers,
shay.banon

On Fri, Apr 16, 2010 at 6:46 PM, Ori Lahav <[hidden email]> wrote:
Thanks Shay for this point in this thread.

having all the threads around benchmarking. Is there some king of a test that you have done, or someone else done using ElasticSearch and can share numbers regarding scale and performance?

things I would like to get a sense of are:
1. number of documents in the index.
2. how many instances on how many machines.
3. at this size how many Writes/s
4. how many Reads/s

if you have a link to previously done benchmark it would be appriciated - or you can share your numbers.

Thanks
Ori


On Fri, Apr 16, 2010 at 2:36 PM, Shay Banon <[hidden email]> wrote:
Hope elasticsearch helps in the area of the squashing :). To be honest, I followed a similar path waay back when I tried to have a distributed Lucene Directory built on top of GigaSpaces/Coherence/Terracotta. From a design perspective, its not very different from what Lucandra or the one over HBase do, and at the end, they suffer from the same limitations that I noted...

cheers,
shay.banon


On Fri, Apr 16, 2010 at 1:49 PM, Tim Robertson <[hidden email]> wrote:
Thanks Shay.  Curiosity squashed



On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
<[hidden email]> wrote:
> This will work up to a point where your index gets big. In this case, your
> search JVM might not be able to hold all the information needed (for
> example, won't be able to load the term info and field cache since its too
> big, and  you will get either OOM or GC trashing).
> cheers,
> shay.banon
>
> On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <[hidden email]>
> wrote:
>>
>> My curiosity was really if I could open many read only index readers
>> to scale the reading I guess.
>>
>>
>>
>> On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
>> <[hidden email]> wrote:
>> > This solution is problematic IMO in how it works and especially with how
>> > Lucene works. With this solution, there is only a single Lucene
>> > IndexWriter
>> > that you can open, so your writes don't scale, regardless of the amount
>> > of
>> > machines you add.
>> > Also, Lucene cached a *lot* of information per reader/searcher
>> > (fieldcache,
>> > terms info, and so on). With large indices, you have a single reader
>> > working
>> > against a very large cluster/index, and your client won't cope with
>> > it.... .
>> > You can't get around it, you need to shard a Lucene index into many
>> > small
>> > Lucene indices running on different machines, but then you need to write
>> > a
>> > distributed Lucene solution. And hey, I think someone already built one
>> > :)
>> > (p.s. I am not even mentioning all the many other features elasticsearch
>> > gives over this very low level Lucene solution).
>> >
>> > Shay
>> >
>> > On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
>> > <[hidden email]>
>> > wrote:
>> >>
>> >> +1 from me.  Very curious of how distributing the storage behind
>> >> Lucene performs, rather than multiple Lucene indexes distributed
>> >> themselves
>> >>
>> >>
>> >> On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> > if you want to make a benchmark about search solutions over big data,
>> >> > you may
>> >> > want to also consider the following solutions:
>> >> >
>> >> >
>> >> >
>> >> > http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend
>> >> >
>> >> > Lucandra uses cassandra as the persistent layer for lucene. The
>> >> > following two
>> >> > attempts have ported lucandra to use HBase:
>> >> >
>> >> > http://github.com/thkoch2001/lucehbase
>> >> > http://github.com/akkumar/hbasene
>> >> >
>> >> > Both of the above are only proof of concepts by now, but may quickly
>> >> > become
>> >> > production ready. In the end you only need to like 5 classes to glue
>> >> > lucene
>> >> > with cassandra or HBase.
>> >> >
>> >> > Best regards,
>> >> >
>> >> > Thomas Koch, http://www.koch.ro
>> >> >
>> >
>> >
>
>




--
http://olahav.typepad.com

Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

Paolo Castagna
Shay Banon wrote:
> If someone would like to create an unbiased benchmark, I would be happy to help.

Will you suggest using JMeter as tool for benchmarking?

Today, I was looking at the Lucene benchmark contrib, ave you ever use
it? I was wondering if I could adapt it to ElasticSearch and Solr or
it's better to start with JMeter.

With JMeter, I am not sure it can produce useful reports people can
then share/exchange.

Thanks,
Paolo
Reply | Threaded
Open this post in threaded view
|

Re: Please consider also HBase/Cassandra+Lucene

kimchy
Administrator
If you plan to use the HTTP interface of both products, then jmeter can be the way to go.

cheers,
sahy.banon

On Fri, Apr 16, 2010 at 10:31 PM, Paolo Castagna <[hidden email]> wrote:
Shay Banon wrote:
If someone would like to create an unbiased benchmark, I would be happy to help.

Will you suggest using JMeter as tool for benchmarking?

Today, I was looking at the Lucene benchmark contrib, ave you ever use
it? I was wondering if I could adapt it to ElasticSearch and Solr or
it's better to start with JMeter.

With JMeter, I am not sure it can produce useful reports people can
then share/exchange.

Thanks,
Paolo

12