Bulk indexing tips

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Bulk indexing tips

Ivan Brusic
Been experimenting with various settings to speed up bulk loading of
30 million medium sized documents to a 2 (for now) node cluster. The
eventual goal is to periodically recreate the entire index to a new
one, while preserving search on the current index via an alias.

First pass was a simple single bulk indexer called via multiple worker
threads, which was adequate but far from ideal. I then switched to
each worker thread having its own bulk indexer. Eventually each worker
swamped ES with too many indexing requests.

If I understand correctly, calling execute() with an ActionListener is
executed asynchronously, while execute().actionGet() is synchronous
(blocking)? It seems that I was starting too many bulk indexing
threads with the asynchronous call that took too long to execute.

Next was TranspontClient versus Node, but there seems was little
difference. However, it seems that all index requests were made to a
single server and not round-robin. Are searches only round-robined or
should index requests be as well? Would it make sense to direct all
index requests to a single server anyways?

The next step is to experiment with indexes settings such as refresh
and the translog. For the various translog settings, which setting has
the highest priority, the one that occurs first or last? For example
if the threshold size hits the limit before the number of operations.
Is a bulk index of 10 documents 1 operation or 10?  Is the
index.gateway.local.sync setting still used? I do not see any
references to it in the code.

Cheers,

Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Bulk indexing tips

Karussell
What is your setup (no. of shards/indices)? How many items do you
include in one bulk index request?

As one tuning option you could increase lucenes merge-segment number
to 20 or 30 (and switch back to 10 afterwards).
Also the refresh interval for real time latency can be increased to 10
seconds or more or even disable it (-1) to get faster indexing time:

http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html

> If I understand correctly, calling execute() with an ActionListener is
> executed asynchronously, while execute().actionGet() is synchronous
> (blocking)?

Yes

> Is a bulk index of 10 documents 1 operation or 10?

What do you mean here? They will send as a chunk to the server but
every document will be added via luceneWriter.updateDocument()

Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Bulk indexing tips

Ivan Brusic
On Tue, Dec 13, 2011 at 2:39 AM, Karussell <[hidden email]> wrote:
> What is your setup (no. of shards/indices)? How many items do you
> include in one bulk index request?

Currently using the default of 5 shards and 1 replica for only 1
index. The idea is to first get all data loaded in order to benchmark
various queries. After which we will decide the number of nodes and
shards. The 30 million document index is our main concern, but there
will be other indices in the system as well. The bulk index request
contains 50 documents, which was set somewhat arbitrarily.

> As one tuning option you could increase lucenes merge-segment number
> to 20 or 30 (and switch back to 10 afterwards).
> Also the refresh interval for real time latency can be increased to 10
> seconds or more or even disable it (-1) to get faster indexing time:
>
> http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html

I knew about the refresh interval (although I did not use it), but did
not think about the number of merge segments. Is there a direct method
to setting the refresh interval in the Java API, or is it simply a
setting?

client.admin().indices().updateSettings(...)

> What do you mean here? They will send as a chunk to the server but
> every document will be added via luceneWriter.updateDocument()

The translog setting is in terms of "operations" and I was no sure at
what level the definition of operation is. Sounds like the single bulk
index request of N documents will be N operations.

Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Bulk indexing tips

Ivan Brusic
In reply to this post by Karussell
On Tue, Dec 13, 2011 at 2:39 AM, Karussell <[hidden email]> wrote:

> Also the refresh interval for real time latency can be increased to 10
> seconds or more or even disable it (-1) to get faster indexing time:

Forgot to add: what would the difference be between setting the
refresh interval to -1 per index versus setting the refresh flag per
request?

indexRequest.refresh(false)

--
Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Bulk indexing tips

kimchy
Administrator
Heya,

  First, here are some pointers to improve indexing performance:

1. One of the simplest and effective ones is to simply start with a index with no replicas. And once indexing is done, increase the number of replicas to the number you want to have. This will reduce the load when indexing.

2. The refresh interval setting can help. Setting it to -1 (using the update setting API) means that the async interval of refreshing the index will not happen (you can still explicitly call refresh though). The refresh flag on the index request defaults to false, and its there to force refresh after an index request has executed. It does not relate to what you do.

3. The translog gets flushed once one of the different thresholds is met. I don't think you need to play with it too much (cause large Lucene commits can actually stall indexing since they are blocking).

4. The default merge policy (tiered - http://www.elasticsearch.org/guide/reference/index-modules/merge.html) is pretty good for bulk indexing as well. You can play with a higher value for segments_per_tier (defaults to 10). These settings can dynamically be set using the update settings API as well.

On Tue, Dec 13, 2011 at 7:54 PM, Ivan Brusic <[hidden email]> wrote:
On Tue, Dec 13, 2011 at 2:39 AM, Karussell <[hidden email]> wrote:

> Also the refresh interval for real time latency can be increased to 10
> seconds or more or even disable it (-1) to get faster indexing time:

Forgot to add: what would the difference be between setting the
refresh interval to -1 per index versus setting the refresh flag per
request?

indexRequest.refresh(false)

--
Ivan

Reply | Threaded
Open this post in threaded view
|

Re: Bulk indexing tips

Ivan Brusic
On Fri, Dec 16, 2011 at 7:20 AM, Shay Banon <[hidden email]> wrote:

>   First, here are some pointers to improve indexing performance:

Thanks for the response. I've been gone from the list from a couple of
months, but I have slowly scanned most of what I have missed.
Unbelievable amount of great new features in that short amount of
time.

> 1. One of the simplest and effective ones is to simply start with a index
> with no replicas. And once indexing is done, increase the number of replicas
> to the number you want to have. This will reduce the load when indexing.

I just played around with setting the number of replicas. The workflow
should be:
1. create a new index with number_of_replicas=0
2. bulk index
3. set number_of_replicas = n (TBD)
4. move search alias to new index

On my current 20GB test index (2 nodes, 5 shards), setting
number_of_replicas to 0 will split the shards between the two nodes.
Then setting number_of_replicas to 1 will cause the shards to be
replicated, pushing out 20GBs of data between the two machines. My
questions are:

For step 2, with no replicas, should bulk indexing occur only on 1
node or should it round-robin between all nodes? Seems like the former
is the obvious choice, but checking if I am missing out on some
benefit by choosing the latter.

Is the new index searchable, at an acceptable performance level, just
after setting number_of_replicas to something other than 0? For
example, if I set number_of_replicas to 2 on a 4 node cluster, would
it be safe to use the index, or should I wait for the cluster state to
return to green? Is there a listener for cluster state notifications?
Would prefer not to have to busy wait on the cluster health.

--
Ivan
Reply | Threaded
Open this post in threaded view
|

Re: Bulk indexing tips

kimchy
Administrator
Answers below:

On Fri, Dec 16, 2011 at 10:12 PM, Ivan Brusic <[hidden email]> wrote:
On Fri, Dec 16, 2011 at 7:20 AM, Shay Banon <[hidden email]> wrote:

>   First, here are some pointers to improve indexing performance:

Thanks for the response. I've been gone from the list from a couple of
months, but I have slowly scanned most of what I have missed.
Unbelievable amount of great new features in that short amount of
time.

cheers!
 

> 1. One of the simplest and effective ones is to simply start with a index
> with no replicas. And once indexing is done, increase the number of replicas
> to the number you want to have. This will reduce the load when indexing.

I just played around with setting the number of replicas. The workflow
should be:
1. create a new index with number_of_replicas=0
2. bulk index
3. set number_of_replicas = n (TBD)
4. move search alias to new index

Sounds good.
 

On my current 20GB test index (2 nodes, 5 shards), setting
number_of_replicas to 0 will split the shards between the two nodes.
Then setting number_of_replicas to 1 will cause the shards to be
replicated, pushing out 20GBs of data between the two machines.

If you know you are going to index to 2 nodes, then I would use 4 or 6 shards so they will be even between the nodes and each node will do balanced share of indexing.
 
My
questions are:

For step 2, with no replicas, should bulk indexing occur only on 1
node or should it round-robin between all nodes? Seems like the former
is the obvious choice, but checking if I am missing out on some
benefit by choosing the latter.

You are using the Java client right? If you use the NodeClient, it will send the respective "shard bulks" directly to the shards, if you use the transport client, i will round robin to a node, and then that node will break the bulk to respective "shard bulks" and send them.
 

Is the new index searchable, at an acceptable performance level, just
after setting number_of_replicas to something other than 0? For
example, if I set number_of_replicas to 2 on a 4 node cluster, would
it be safe to use the index, or should I wait for the cluster state to
return to green? Is there a listener for cluster state notifications?
Would prefer not to have to busy wait on the cluster health.

Yes, the index is searchable. There will be a cost of allocation those replicas and moving data to them. You can reduce that using some settings described here: http://www.elasticsearch.org/guide/reference/modules/cluster.html. For example, setting: node_concurrent_recoveries to 1, and indices.recovery.concurrent_streams to 1.
 

--
Ivan