Quantcast

Looking for advice on bulk loading

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Looking for advice on bulk loading

Rob Styles
Hi all,

I'm looking for some advice on getting lots of docs into es.

We are generated es docs from a db and each doc is around 1k at the moment. We expect to increase the size of each doc to around 4k as we include more data. Our initial set is 250M docs.

The db is distributed and we're using hadoop's map/reduce to generate the json docs. We could write them out to disk and load, though at the moment we've been trying to post them direct into es. We're using the java client api and bulk requests with 20 docs each at the moment. We have capacity to run around 160 doc generators concurrently, and doing that knocks es over quite nicely. We have a 6 node cluster and each node is running both our converters and es.

Is anyone doing anything similar and would be able to suggest approaches to loading the data into es in the most efficient way?

thanks

rob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Looking for advice on bulk loading

Drew Raines-2
Rob Styles wrote:

> We are generated es docs from a db and each doc is around 1k at the
> moment.  We expect to increase the size of each doc to around 4k as
> we include more data. Our initial set is 250M docs.

[...]

> We have capacity to run around 160 doc generators concurrently, and
> doing that knocks es over quite nicely.
>
> We have a 6 node cluster and each node is running both our
> converters and es.

First of all, what kind of document index rate do you get before it
falls over?  It may be that you're currently maximizing perf and you
would need to add nodes.

Some scattershot questions to try and find a possible issue:

What exactly do you mean by knocking ES over?  Do you get OOMEs?  Do
your nodes become unresponsive?  Is it only a few nodes that have
trouble?

How much RAM do the machines have and what do you have set for your
ES heap settings?

Are you indexing into a single index or multiple?  What's the output
of `es shards` from https://github.com/elasticsearch/es2unix?

If you back off the number of generators, at what number does the
cluster experience problems?

-Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Looking for advice on bulk loading

joergprante@gmail.com
In reply to this post by Rob Styles
Hi Rob,

I am not sure what your doc generators look like, so I understand you
send from 160 threads (or JVMs?), each one using a (Transport)Client,
with the capacity of 160 x 20 = 3200 docs simultaneously? Or is it more?
Do you connect to all 6 nodes (TransportClient sniff mode enabled)? 3200
docs per second should be possible, it depends on the fields and
analyzers of course.

I understand your setup that Hadoop converter, the ES (Transport)Client
instance, and ES itself share a single machine? Or a single JVM? I would
suggest to spread the ingest load to other machines with dedicated
TransportClients that can connect from remote to the cluster. So,
preparing the JSON with Hadoop converters or whatever and compressing
them for the wire is decoupled from the ES cluster main task, the
Lucene-based indexing. And, with the (gigabit) network in between, you
can measure the throughput you can push into the cluster more easily.

It would be cool to get more info how your bulk requests are managed.
Did you look at org.elasticsearch.action.bulk.BulkProcessor? It
demonstrates how to throttle bulk indexing with the help of a semaphore
around the BulkResponses. Note, there are also thread pools and netty
workers that can be reconfigured for large bulks on multicore machines.
With 160 simultaneous threads there is probably not much room for
queueing, but it may help when the clients wait for outstanding bulk
requests when a concurrency limit is exceeded or pause when a specific
response time limit is exceeded.

Best regards,

Jörg

Am 15.02.13 14:50, schrieb Rob Styles:

> Hi all,
>
> I'm looking for some advice on getting lots of docs into es.
>
> We are generated es docs from a db and each doc is around 1k at the
> moment. We expect to increase the size of each doc to around 4k as we
> include more data. Our initial set is 250M docs.
>
> The db is distributed and we're using hadoop's map/reduce to generate
> the json docs. We could write them out to disk and load, though at the
> moment we've been trying to post them direct into es. We're using the
> java client api and bulk requests with 20 docs each at the moment. We
> have capacity to run around 160 doc generators concurrently, and doing
> that knocks es over quite nicely. We have a 6 node cluster and each
> node is running both our converters and es.
>
> Is anyone doing anything similar and would be able to suggest
> approaches to loading the data into es in the most efficient way?
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Looking for advice on bulk loading

Jon Shea
In reply to this post by Rob Styles
We do pretty much the the same thing you do. We will backfill either by running a script that traverses a database table and sends bulk index requests to our Elasticsearch cluster over the Java API, or by running a MapReduce on Hadoop cluster that sends bulk index requests to Elasticsearch over the Java API. Either way, we put a `sleep` in between bulk insert requests which we can adjust dynamically to slow down the backfill if our Elasticsearch cluster is getting overloaded. You’ll have to experiment to figure out how quickly you can push the backfill. I recommend figuring out a way to put in a `sleep` that you can adjust while the backfill is running to make it easier to figure out where your limits are and adjust. I also recommend catching index errors or exceptions and waiting for a generous period of time (several seconds) before retrying.

In the blog post announcing the Index Update Settings API (http://www.elasticsearch.org/blog/2011/03/23/update-settings.html) Shay recommended setting `"index.refresh_interval": -1` and `"index.merge.policy.merge_factor": 30` at the start of the backfill, and then restoring them to their defaults after the backfill. There is also this gist (https://gist.github.com/duydo/2427158) that purports to be advice from Shay, but some of the advice is confusing or contradictory. I don’t understand how `index.translog` relates to `index.refresh_interval`, for example. I also don’t understand why the Thrift API would be much better, since it still requires a serialized JSON representation of the document. There’s not much else you can do, as far as I know.

Ideally, we’d love to be able to do a MapReduce that wrote Lucene / Elasticsearch index files to disk on our Hadoop cluster outside of Elasticsearch. And we’d like to be able to deploy these indexes by doing something like scp'ing them into place on our Elasticsearch cluster. But we haven’t yet invested the resources to figure out exactly how to make that work.

-Jon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Looking for advice on bulk loading

joergprante@gmail.com
You are referring to hints about speeding up indexing. In most cases,
you can gain more efficiency about 10-20% with this, so the hints are
for situations where you want to squeeze out the most of the existing
resources. But, to set up bulk indexing in normal situations, you don't
always need such tweaking, you can get very far with the default ES
settings.

The idea behind tweaking ES settings is as follows: while long baseline
bulk loads are running, admins like to disregard search, and realtime
search, in favor of some percentage of performance on the side of Lucene
indexing. The refresh interval disables realtime search, so,
IndexReader/IndexWriter switches and read I/O is reduced, and write I/O
can run with higher throughput. The merge policy factor may be
increased, to give Lucene's expensive segment merging more room.

The translog settings are interesting when you observe how many IOPS
(input/output operations per second) indexing needs.The idea is to
reduce the number of IOPS to reduce the stress on the disk subsystem.
Disk I/O is the slowest part in system efficiency by far, for example,
if ES indexing or translogging is using too many flushes to disk, the
indexing speed will badly suffer. Changing the translog flush settings
is one method, but, replacing slow disks by faster disks or SSD (or
loads of RAM) will gain far more efficiency.

Thrift protocol does not use JSON, it is an alternative to HTTP JSON. It
uses a compact binary protocol for object transports and reduces the
protocol overhead significantly. The serialization and deserialization
is faster than in HTTP. ES offers an optional Thrift plugin. For more
about Thrift, see http://jnb.ociweb.com/jnb/jnbJun2009.html

Jörg

Am 16.02.13 02:29, schrieb Jon Shea:

> In the blog post announcing the Index Update Settings API
> (http://www.elasticsearch.org/blog/2011/03/23/update-settings.html)
> Shay recommended setting `"index.refresh_interval": -1` and
> `"index.merge.policy.merge_factor": 30` at the start of the backfill,
> and then restoring them to their defaults after the backfill. There is
> also this gist (https://gist.github.com/duydo/2427158) that purports
> to be advice from Shay, but some of the advice is confusing or
> contradictory. I don’t understand how `index.translog` relates to
> `index.refresh_interval`, for example. I also don’t understand why the
> Thrift API would be much better, since it still requires a serialized
> JSON representation of the document. There’s not much else you can do,
> as far as I know.
>
> Ideally, we’d love to be able to do a MapReduce that wrote Lucene /
> Elasticsearch index files to disk on our Hadoop cluster outside of
> Elasticsearch. And we’d like to be able to deploy these indexes by
> doing something like scp'ing them into place on our Elasticsearch
> cluster. But we haven’t yet invested the resources to figure out
> exactly how to make that work.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Looking for advice on bulk loading

Jon Shea
Jörg,

I completely agree with your indexing advice. To summarize for Rob:

1) You’re doing it pretty much how everyone does it.
2) If you’re taking down your cluster, you need to slow down the rate you index. Use a configurable `sleep`, or fewer Hadoop machines.
3) If you want to index documents faster, the best thing you can do is run more shards on more Elasticsearch nodes (one shard per node is optimal). It might also help to use better hardware (like SSDs), but I haven’t profiled that.
4) The Elasticsearch defaults for indexing are pretty good, but you might be able to tweak them to get tens of percent improvements. See the links in my original post.

I remain a little skeptical about the Thrift API, though. I’ve looked at it several times, and it’s really more of an HTTP on Thrift API than a first class Thrift API. The Thrift struct for API requests has a binary blob `body` (https://github.com/elasticsearch/elasticsearch-transport-thrift/blob/master/elasticsearch.thrift#L23). I haven’t bothered to fully trace the code, and the usage isn’t thoroughly documented, but I presume that to use the Thrift API you form the JSON for an API request, serialized it to UTF-8, and then put it in the `body` field of a Thrift `RestRequest`. Please correct me if I’m working about that. Parsing a Thrift API request might be somewhat less work for Elasticsearch than parsing an HTTP API request, but parsing the `body` contents of a Thrift API request is going to be the same parsing the body contents of an HTTP request. I haven’t profiled this, but I’d be surprised if Elasticsearch was spending a ton of time parsing HTTP overhead.

Regardless, thanks for your advice on indexing. I really appreciate it.

-Jon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Loading...