batch submission?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

batch submission?

Colin Surprenant
Hi,

In the Riak mailing list
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-April/000927.html
, Eric Gaumer made the excellent suggestion of adding a batch
submission endpoint in ES to avoid the HTTP overhead when dealing with
very large amount of documents to submit.

Is this something that could be easily added in ES? What do you think?

Thanks,
Colin
Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

kimchy
Administrator
Batch submission can be added, but first, note that batch submission will not be a transactional one (either all succeed or fail). Also,instead of using batch submission, you can either multithread or async each actual operation you want to do. You should get very similar result as batching when you do it.

cheers,
shay.banon

On Fri, Apr 16, 2010 at 5:55 PM, Colin Surprenant <[hidden email]> wrote:
Hi,

In the Riak mailing list
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-April/000927.html
, Eric Gaumer made the excellent suggestion of adding a batch
submission endpoint in ES to avoid the HTTP overhead when dealing with
very large amount of documents to submit.

Is this something that could be easily added in ES? What do you think?

Thanks,
Colin

Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

egaumer
On Fri, Apr 16, 2010 at 11:38 AM, Shay Banon <[hidden email]> wrote:
Batch submission can be added, but first, note that batch submission will not be a transactional one (either all succeed or fail). Also,instead of using batch submission, you can either multithread or async each actual operation you want to do. You should get very similar result as batching when you do it.

In a multithread situation (async or other wise), you still have to deal with message passing (network latency etc.) semantics. I would argue that passing batches of 1000 documents in a single thread would still be faster than spawning 1000 threads that all submit a single document. Am I wrong? Maybe at small batch sizes they are pretty equal but what about as the batch size increases?

I guess I'm mainly focused on the HTTP interface and the overhead associated with this type of messaging. Batching seems like a reasonable way to reduce latency in this particular area but could very well create bottlenecks elsewhere (i.e., index writing).

Even still, if multithreading is an option, then wouldn't sending batches across each of those threads be more efficient than sending one document at a time?

So assume I have 100 million documents of 3K each and I *need* to use HTTP. I plan on using 20 threads per node using 3 node feeding cluster (BTW, this is, without a doubt, a common scenario in enterprise search deployments).

Being able to send a batch of a few hundred documents across each connection is going to save me a lot of HTTP calls. No?

I think the transaction semantics are reasonable. If I send a batch of 200 documents, I would expect the batch to fail or succeed as one unit otherwise it's much harder for me to resubmit. This is generally how some of the commercial vendors do it.

Regards,
-Eric

Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

kimchy
Administrator
Batching will certainly increase the number of documents you can index. If you use http, with keep alive, the overhead of sending one document at a time should not be that high. But, of course, it depends on a lot of factors. In Java the HTTP aspect does not add a lot of overhead (the header and such) compared to the latency of the rest of the request if you do it right, but I am not sure how much overhead you have for HTTP in ruby and others...

I will add batching, and people can play with it and see if they can get better performance.

Regarding the all will fail or not. I was saying that elasticsearch will *not* support this. If you do batching, the request will hit several shards and elasticsearch will not do two phase commit across potentially many resources (shards), especially since, by itself, two phase commit is broken (but thats a different story) when it comes to many resources. The API will simply return a status for each element in the batch, i.e., if it worked or not.

cheers,
shay.banon

On Fri, Apr 16, 2010 at 7:21 PM, Eric Gaumer <[hidden email]> wrote:
On Fri, Apr 16, 2010 at 11:38 AM, Shay Banon <[hidden email]> wrote:
Batch submission can be added, but first, note that batch submission will not be a transactional one (either all succeed or fail). Also,instead of using batch submission, you can either multithread or async each actual operation you want to do. You should get very similar result as batching when you do it.

In a multithread situation (async or other wise), you still have to deal with message passing (network latency etc.) semantics. I would argue that passing batches of 1000 documents in a single thread would still be faster than spawning 1000 threads that all submit a single document. Am I wrong? Maybe at small batch sizes they are pretty equal but what about as the batch size increases?

I guess I'm mainly focused on the HTTP interface and the overhead associated with this type of messaging. Batching seems like a reasonable way to reduce latency in this particular area but could very well create bottlenecks elsewhere (i.e., index writing).

Even still, if multithreading is an option, then wouldn't sending batches across each of those threads be more efficient than sending one document at a time?

So assume I have 100 million documents of 3K each and I *need* to use HTTP. I plan on using 20 threads per node using 3 node feeding cluster (BTW, this is, without a doubt, a common scenario in enterprise search deployments).

Being able to send a batch of a few hundred documents across each connection is going to save me a lot of HTTP calls. No?

I think the transaction semantics are reasonable. If I send a batch of 200 documents, I would expect the batch to fail or succeed as one unit otherwise it's much harder for me to resubmit. This is generally how some of the commercial vendors do it.

Regards,
-Eric


Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

egaumer
On Fri, Apr 16, 2010 at 12:32 PM, Shay Banon <[hidden email]> wrote:

Regarding the all will fail or not. I was saying that elasticsearch will *not* support this. If you do batching, the request will hit several shards and elasticsearch will not do two phase commit across potentially many resources (shards), especially since, by itself, two phase commit is broken (but thats a different story) when it comes to many resources. The API will simply return a status for each element in the batch, i.e., if it worked or not.

Ahh... got ya. I think as long as people understand the limitations it's (not ideal) but okay. I think this functionality would used mainly to bootstrap an index with some pre-existing data. Once that process is complete, you'd typically switch to an incremental of near real-time feed anyway (at least that's what I'd suggest).

-Eric

Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

kimchy
Administrator
Sounds great!. Want to open an issue for the batch thingy?

cheers,
shay.banon

On Fri, Apr 16, 2010 at 7:40 PM, Eric Gaumer <[hidden email]> wrote:
On Fri, Apr 16, 2010 at 12:32 PM, Shay Banon <[hidden email]> wrote:

Regarding the all will fail or not. I was saying that elasticsearch will *not* support this. If you do batching, the request will hit several shards and elasticsearch will not do two phase commit across potentially many resources (shards), especially since, by itself, two phase commit is broken (but thats a different story) when it comes to many resources. The API will simply return a status for each element in the batch, i.e., if it worked or not.

Ahh... got ya. I think as long as people understand the limitations it's (not ideal) but okay. I think this functionality would used mainly to bootstrap an index with some pre-existing data. Once that process is complete, you'd typically switch to an incremental of near real-time feed anyway (at least that's what I'd suggest).

-Eric


Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

egaumer
On Fri, Apr 16, 2010 at 12:53 PM, Shay Banon <[hidden email]> wrote:
Sounds great!. Want to open an issue for the batch thingy?


-Eric

Reply | Threaded
Open this post in threaded view
|

Re: batch submission?

Colin Surprenant
Very nice. Thanks. Will definitely run some performance tests when its
available.

I agree with Eric that the typical use-case for this would be
bootstrap an index with some pre-existing data. This is how I plan to
use it.

Having to parse the results to check the status for each element works
for me.

Colin

On Apr 16, 2:04 pm, Eric Gaumer <[hidden email]> wrote:
> On Fri, Apr 16, 2010 at 12:53 PM, Shay Banon
> <[hidden email]>wrote:
>
> > Sounds great!. Want to open an issue for the batch thingy?
>
> http://github.com/elasticsearch/elasticsearch/issues#issue/138
>
> -Eric