In the tradition of unscientific benchmarks :)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

In the tradition of unscientific benchmarks :)

mooky

Just thought Id share some numbers - take them with the pinch of salt
all "benchmarks" should be taken with.
By no means is this a comprehensive or scientific benchmark or
performed in ideal conditions (windows machine, running mcafee ...
ouch).

Very briefly, the use-case I have is a bit weird: I have small (tiny)
documents that I get, by the 100s-1000s per second - and I get them 1
at a time, in 1 thread. I was mostly interested in how quickly I could
store/index them & have a decent enough throughput to handle the
average documents/second as well as have some headroom to handle the
bursts which might be 10-100 times the average. The 'real'time'-ness
of the application is also pretty loose. So long as stuff gets indexed
within a few seconds, its good enough.

Here is what a typical document looks like:
{
  "_ric" : "0005.HK",
  "BID" : 0.0,
  "ASK" : 0.0,
  "BEST_BSIZ1" : -0.0,
  "BEST_ASIZ1" : -0.0,
  "_timestamp" : "2010-03-20T01:38:54.217Z"
}

Solr was the first thing I experimented with. If it was going to be
fast enough, and I didnt have to write much code, it would be a good
first port of call. Ultimately, I knew that if I had do, I could do my
own buffering & batching ... but I didnt really want to write that if
I didnt have to. At first Solr was not good enough, it wouldnt keep
up. But I turned to the mailing list and got some helpful advice.
Making some changes, I got it fast enough I didnt have to write any
buffering/batching crap. Result!

This afternoon, I had a bit of time, I thought I would give
elasticsearch a spin - see how some numbers compared:

Roughly, here are the results:

CommonsHttpSolrServer Took 145646ms to index 50,000 documents (343 per
second)
EmbeddedSolrServer Took 11000ms to index 50,000 documents (4545 per
second) (some as high as 6000 per second)
ElasticSearch Took 4688ms to index 50,000 documents (10665 per second)
(some as high as 15000 per second)
StreamingUpdateSolrServer Took 1031ms to index 50,000 documents (48496
per second)
ElasticSearchAsync Took 390ms to index 50,000 documents (128205 per
second)

CommonsHttpSolrServer is the standard client interface with Solr.
Given I was indexing 1 document at a time, and it was a http call per
document, its no surprise the performance was poor.

By the way, its worthwhile noting that for all these configurations
(except embedded), I was running Solr in the same JVM - and accessing
via http://localhost.
Also worthwile noting I was using autocommit - I was happy to make use
of the buffering inside solr.
        <autoCommit>
          <maxDocs>10000</maxDocs>
          <maxTime>3000</maxTime>
        </autoCommit>


EmbeddedSolrServer, as the name suggests, was using the embedde api.
No http hop.

ElasticSearchSync was using elasticsearch in a synchronous fashion.
The elasticsearch apis are all asynchronous, they return a result
object that you can then wait on.

StreamingUpdateSolrServer is similar to CommonsHttpSolrServer, except
it has a client-side buffer and it submits to the server in batches.
It does exactly what I didnt want to spend time writing :). Also, the
buffer size was 50,000 - the same size as my test run. So the test run
only really measures how quickly documents can created and stuffed
into the buffer.

ElasticSearchAsync was using elastic search in an asynchronous way.
This is the default out of the box. Quick!

The other thing worth noting is that my input data in my test was
actually a map.
In the Solr case, I am turning that map into json (using jackson) and
stuffing it into the document (stored, not indexed) so I could return
that json as the results, when queried later.
Interestingly enough, thats pretty much what elasticsearch does under
the hood.

For what its worth, here is my code for setting up the server:
        Server server = serverBuilder()
                .settings(settingsBuilder()
                        .put("monitor.memory.alpha.translogNumberOfOperationsThreshold",
10000)
                        .put("index.engine.robin", 10,
TimeUnit.SECONDS)
                        .put("threadpool.type", "dynamic") //
"scaling" in next release
                        .put("node.local", true)
                )
                .server();

        client = server.client();
Thats it. No config files, nothing. Nice.


And indexing my data :
            JsonBuilder<BinaryJsonBuilder> builder = jsonBuilder()
                    .startObject()
                    .field(RIC, ric);

            for (Map.Entry<String, Object> entry : data.entrySet()) {
                builder.field(entry.getKey(), entry.getValue());
            }

            client.index(
                    indexRequest(TICKINDEX)
                            .type("tick")
//                            .operationThreaded(true) // this will
make it completely async
                            .operationThreaded(false)
                            .source(builder.endObject())
            );

(FWIW, its quite a bit more code for Solr to do the indexing. A 160
line class, in fact)

So, there you have it. It seems quite quick (in this scenario at
least) and quite nice to use. Im eagerly awaiting it to go 1.0 :)