Indexing multiple things at once. Possible?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing multiple things at once. Possible?

elasticsearcher
I've searched around on the docs, and I haven't found a solution, so I thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall we say, 1000 every few seconds, per thread, and I have many threads on many nodes), and then insert them into ElasticSearch for indexing one-by-one until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the same type)?

I am currently using the REST API via python, but if this feature exists in a different API instead, it is conceivable that I could incorporate it into my program.

My document type looks like:

{
    Name1:<string>
    Name2:<string>
    Percent:<int>
}

I'm imagining the slowdown is simply because I have to push thousands of documents to the cloud, one-by-one, even though I have large chunks of them generated at once, and the overhead of individual transfers/indexing is the bottleneck.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple things at once. Possible?

kimchy
Administrator
Its important to understand where the bottleneck is. When you say index documents "into" the cloud, what do you mean? Is that a WAN call?

On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <[hidden email]> wrote:

I've searched around on the docs, and I haven't found a solution, so I
thought I'd ask here.

In my program, I generate many short documents to index very quickly (shall
we say, 1000 every few seconds, per thread, and I have many threads on many
nodes), and then insert them into ElasticSearch for indexing one-by-one
until they're gone. I believe this may be a bottleneck in my system.

Is there any way to index a large batch of documents at once (all of the
same type)?

I am currently using the REST API via python, but if this feature exists in
a different API instead, it is conceivable that I could incorporate it into
my program.

My document type looks like:

{
   Name1:<string>
   Name2:<string>
   Percent:<int>
}

I'm imagining the slowdown is simply because I have to push thousands of
documents to the cloud, one-by-one, even though I have large chunks of them
generated at once, and the overhead of individual transfers/indexing is the
bottleneck.
--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-things-at-once-Possible-tp1317722p1317722.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple things at once. Possible?

Mahendra M
In reply to this post by elasticsearcher
Hi,

I also had a similar requirement. I dunno if this solution will work
for you. You can try an alternate approach.

Instead of indexing the documents directly, queue them to a message
queue. (like rabbitmq).

Have consumers which will keep reading from the queue and index the
document into elasticsearch.

This way, by de-coupling your document generation and document
indexing, you need not worry about the rate at which your documents
are being created.

Also, since your documents seem to be small, this will not be much of
an overhead on messaging systems.

If you use a framework like celery, this is done very transparently
for you. You don't have to understand (deeply) about AMQP and similar
technologies.

Assuming that you are doing this on a cloud setup, you may already
have access to a RabbitMQ setup.

Regards,
Mahendra

http://twitter.com/mahendra



On Wed, Aug 25, 2010 at 12:32 AM, elasticsearcher
<[hidden email]> wrote:

>
> I've searched around on the docs, and I haven't found a solution, so I
> thought I'd ask here.
>
> In my program, I generate many short documents to index very quickly (shall
> we say, 1000 every few seconds, per thread, and I have many threads on many
> nodes), and then insert them into ElasticSearch for indexing one-by-one
> until they're gone. I believe this may be a bottleneck in my system.
>
> Is there any way to index a large batch of documents at once (all of the
> same type)?
>
> I am currently using the REST API via python, but if this feature exists in
> a different API instead, it is conceivable that I could incorporate it into
> my program.
>
> My document type looks like:
>
> {
>    Name1:<string>
>    Name2:<string>
>    Percent:<int>
> }
>
> I'm imagining the slowdown is simply because I have to push thousands of
> documents to the cloud, one-by-one, even though I have large chunks of them
> generated at once, and the overhead of individual transfers/indexing is the
> bottleneck.
> --
> View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-things-at-once-Possible-tp1317722p1317722.html
> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple things at once. Possible?

Mahendra M
Hi,

One more tip. Following in the same line -

You can try out - http://wiki.github.com/jbrisbin/rabbitmq-webhooks/

This can automate the job of listening for messages and indexing to
ElasticSearch.

However, please note that rabbitmq-webhooks is in very early stages of
development (and, as documented, is known to be nasty to RabbitMQ as
of now).

Regards,
Mahendra

On Mon, Aug 30, 2010 at 6:29 PM, Mahendra M <[hidden email]> wrote:

> Hi,
>
> I also had a similar requirement. I dunno if this solution will work
> for you. You can try an alternate approach.
>
> Instead of indexing the documents directly, queue them to a message
> queue. (like rabbitmq).
>
> Have consumers which will keep reading from the queue and index the
> document into elasticsearch.
>
> This way, by de-coupling your document generation and document
> indexing, you need not worry about the rate at which your documents
> are being created.
>
> Also, since your documents seem to be small, this will not be much of
> an overhead on messaging systems.
>
> If you use a framework like celery, this is done very transparently
> for you. You don't have to understand (deeply) about AMQP and similar
> technologies.
>
> Assuming that you are doing this on a cloud setup, you may already
> have access to a RabbitMQ setup.
>
> Regards,
> Mahendra
>
> http://twitter.com/mahendra
>
>
>
> On Wed, Aug 25, 2010 at 12:32 AM, elasticsearcher
> <[hidden email]> wrote:
>>
>> I've searched around on the docs, and I haven't found a solution, so I
>> thought I'd ask here.
>>
>> In my program, I generate many short documents to index very quickly (shall
>> we say, 1000 every few seconds, per thread, and I have many threads on many
>> nodes), and then insert them into ElasticSearch for indexing one-by-one
>> until they're gone. I believe this may be a bottleneck in my system.
>>
>> Is there any way to index a large batch of documents at once (all of the
>> same type)?
>>
>> I am currently using the REST API via python, but if this feature exists in
>> a different API instead, it is conceivable that I could incorporate it into
>> my program.
>>
>> My document type looks like:
>>
>> {
>>    Name1:<string>
>>    Name2:<string>
>>    Percent:<int>
>> }
>>
>> I'm imagining the slowdown is simply because I have to push thousands of
>> documents to the cloud, one-by-one, even though I have large chunks of them
>> generated at once, and the overhead of individual transfers/indexing is the
>> bottleneck.
>> --
>> View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-things-at-once-Possible-tp1317722p1317722.html
>> Sent from the ElasticSearch Users mailing list archive at Nabble.com.
>>
>



--
Mahendra

http://twitter.com/mahendra
Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple things at once. Possible?

elasticsearcher
In reply to this post by kimchy
In essence, I have a large number of documents which are generated in
Large quantities very very quickly, and I'm looking for the way to
index them as fast as possible. I was wondering if there were a way to
index, say, a batch of documents more quickly than indexing each
document individually.

If this isn't something feasible, would it be possible for me to split
off a few threads to help send indexing requests to elasticsearch more
quickly? My question is really aimed at understanding how
elasticsearch deals with indexing requests. Would it be advantageous
to issue as many indexing requests as possible, or would elasticsearch
start to get overloaded? (Or, at what point would elasticsearch get
overloaded?)

For instance, my cluster is currently running on five regular old pc's
(servers coming soon), each with 2GB ram (1GB allocated to ES), dual
core intel cpu, etc. My program, running on each node, will be
generating lists of, shall we say for simplicity, 1000 documents,
essentially as fast as they can. After generating the 1000 documents,
they currently sit there and submit the documents one-by-one to
elasticsearch for indexing until the documents are all gone. They then
generate a new 1000 documents and repeat the process. My program
already has multiple threads which could all be generating sets of
1000 documents at once, maybe 3000-4000 documents queued up at any
time, on each node.

Since I'm not very familiar with how ES actually does the indexing,
I'm really just looking for advice on how to get my large number of
documents indexed as quickly as possible.

On Aug 27, 3:52 pm, Shay Banon <[hidden email]> wrote:

> Its important to understand where the bottleneck is. When you say index
> documents "into" the cloud, what do you mean? Is that a WAN call?
>
> On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <[hidden email]
>
>
>
> > wrote:
>
> > I've searched around on the docs, and I haven't found a solution, so I
> > thought I'd ask here.
>
> > In my program, I generate many short documents to index very quickly (shall
> > we say, 1000 every few seconds, per thread, and I have many threads on many
> > nodes), and then insert them into ElasticSearch for indexing one-by-one
> > until they're gone. I believe this may be a bottleneck in my system.
>
> > Is there any way to index a large batch of documents at once (all of the
> > same type)?
>
> > I am currently using the REST API via python, but if this feature exists in
> > a different API instead, it is conceivable that I could incorporate it into
> > my program.
>
> > My document type looks like:
>
> > {
> >    Name1:<string>
> >    Name2:<string>
> >    Percent:<int>
> > }
>
> > I'm imagining the slowdown is simply because I have to push thousands of
> > documents to the cloud, one-by-one, even though I have large chunks of them
> > generated at once, and the overhead of individual transfers/indexing is the
> > bottleneck.
> > --
> > View this message in context:
> >http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-thi...
> > Sent from the ElasticSearch Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple things at once. Possible?

kimchy
Administrator
You should certainly use several threads / processes (/machines) to index data. When you index data, it gets redirected to the appropriate shard, and then gets replicated to its replica shards. Usually, when it comes to indexing, you can monitor the cpu first, io later, and if it gets maxed out, you are over indexing... .

-shay.banon

On Wed, Sep 1, 2010 at 9:05 PM, elastic searcher <[hidden email]> wrote:
In essence, I have a large number of documents which are generated in
Large quantities very very quickly, and I'm looking for the way to
index them as fast as possible. I was wondering if there were a way to
index, say, a batch of documents more quickly than indexing each
document individually.

If this isn't something feasible, would it be possible for me to split
off a few threads to help send indexing requests to elasticsearch more
quickly? My question is really aimed at understanding how
elasticsearch deals with indexing requests. Would it be advantageous
to issue as many indexing requests as possible, or would elasticsearch
start to get overloaded? (Or, at what point would elasticsearch get
overloaded?)

For instance, my cluster is currently running on five regular old pc's
(servers coming soon), each with 2GB ram (1GB allocated to ES), dual
core intel cpu, etc. My program, running on each node, will be
generating lists of, shall we say for simplicity, 1000 documents,
essentially as fast as they can. After generating the 1000 documents,
they currently sit there and submit the documents one-by-one to
elasticsearch for indexing until the documents are all gone. They then
generate a new 1000 documents and repeat the process. My program
already has multiple threads which could all be generating sets of
1000 documents at once, maybe 3000-4000 documents queued up at any
time, on each node.

Since I'm not very familiar with how ES actually does the indexing,
I'm really just looking for advice on how to get my large number of
documents indexed as quickly as possible.

On Aug 27, 3:52 pm, Shay Banon <[hidden email]> wrote:
> Its important to understand where the bottleneck is. When you say index
> documents "into" the cloud, what do you mean? Is that a WAN call?
>
> On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <[hidden email]
>
>
>
> > wrote:
>
> > I've searched around on the docs, and I haven't found a solution, so I
> > thought I'd ask here.
>
> > In my program, I generate many short documents to index very quickly (shall
> > we say, 1000 every few seconds, per thread, and I have many threads on many
> > nodes), and then insert them into ElasticSearch for indexing one-by-one
> > until they're gone. I believe this may be a bottleneck in my system.
>
> > Is there any way to index a large batch of documents at once (all of the
> > same type)?
>
> > I am currently using the REST API via python, but if this feature exists in
> > a different API instead, it is conceivable that I could incorporate it into
> > my program.
>
> > My document type looks like:
>
> > {
> >    Name1:<string>
> >    Name2:<string>
> >    Percent:<int>
> > }
>
> > I'm imagining the slowdown is simply because I have to push thousands of
> > documents to the cloud, one-by-one, even though I have large chunks of them
> > generated at once, and the overhead of individual transfers/indexing is the
> > bottleneck.
> > --
> > View this message in context:
> >http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-thi...
> > Sent from the ElasticSearch Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple things at once. Possible?

Berkay Mollamustafaoglu-2
In reply to this post by elasticsearcher
If you use the async (non blocking) interface, you can index really fast even if you're sending the docs one by one, a batch process is not really needed. 
If you'll have 5 servers, I'd guess that 3-4K documents would not be an issue, ES would easily keep up with that. We're able to index 1000 docs with 100 fields each, on a single 4 core CPU PC.

You can watch cpu/io to throttle if necessary.  Also, you may want to use blocking threadpool. http://www.elasticsearch.com/docs/elasticsearch/modules/threadpool/blocking/


Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Wed, Sep 1, 2010 at 2:05 PM, elastic searcher <[hidden email]> wrote:
In essence, I have a large number of documents which are generated in
Large quantities very very quickly, and I'm looking for the way to
index them as fast as possible. I was wondering if there were a way to
index, say, a batch of documents more quickly than indexing each
document individually.

If this isn't something feasible, would it be possible for me to split
off a few threads to help send indexing requests to elasticsearch more
quickly? My question is really aimed at understanding how
elasticsearch deals with indexing requests. Would it be advantageous
to issue as many indexing requests as possible, or would elasticsearch
start to get overloaded? (Or, at what point would elasticsearch get
overloaded?)

For instance, my cluster is currently running on five regular old pc's
(servers coming soon), each with 2GB ram (1GB allocated to ES), dual
core intel cpu, etc. My program, running on each node, will be
generating lists of, shall we say for simplicity, 1000 documents,
essentially as fast as they can. After generating the 1000 documents,
they currently sit there and submit the documents one-by-one to
elasticsearch for indexing until the documents are all gone. They then
generate a new 1000 documents and repeat the process. My program
already has multiple threads which could all be generating sets of
1000 documents at once, maybe 3000-4000 documents queued up at any
time, on each node.

Since I'm not very familiar with how ES actually does the indexing,
I'm really just looking for advice on how to get my large number of
documents indexed as quickly as possible.

On Aug 27, 3:52 pm, Shay Banon <[hidden email]> wrote:
> Its important to understand where the bottleneck is. When you say index
> documents "into" the cloud, what do you mean? Is that a WAN call?
>
> On Tue, Aug 24, 2010 at 10:02 PM, elasticsearcher <[hidden email]
>
>
>
> > wrote:
>
> > I've searched around on the docs, and I haven't found a solution, so I
> > thought I'd ask here.
>
> > In my program, I generate many short documents to index very quickly (shall
> > we say, 1000 every few seconds, per thread, and I have many threads on many
> > nodes), and then insert them into ElasticSearch for indexing one-by-one
> > until they're gone. I believe this may be a bottleneck in my system.
>
> > Is there any way to index a large batch of documents at once (all of the
> > same type)?
>
> > I am currently using the REST API via python, but if this feature exists in
> > a different API instead, it is conceivable that I could incorporate it into
> > my program.
>
> > My document type looks like:
>
> > {
> >    Name1:<string>
> >    Name2:<string>
> >    Percent:<int>
> > }
>
> > I'm imagining the slowdown is simply because I have to push thousands of
> > documents to the cloud, one-by-one, even though I have large chunks of them
> > generated at once, and the overhead of individual transfers/indexing is the
> > bottleneck.
> > --
> > View this message in context:
> >http://elasticsearch-users.115913.n3.nabble.com/Indexing-multiple-thi...
> > Sent from the ElasticSearch Users mailing list archive at Nabble.com.