Poor performance updating

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Poor performance updating

haarts
Dear list,

We were stoked when we found out about the updating feature in the recent 0.19.0rc2 release. We have been eagerly experimenting with it but are disappointed by it's performance. Hopefully you can tell us we are doing something wrong.

We roughly use this model: https://gist.github.com/1751349. Starting from a clean index it takes 7 seconds to index 1000 documents (ok-ish). After indexing 3 million documents performance degrades to 30 seconds per 1000 documents (prohibitively slow). We expect to insert 500 million documents plus 4 million a day.

Our approach inserting documents is as follows:
We first try to update a document, if that returns an error we instead create it.
The resulting documents can contain hundreds and possibly thousands of 'interactions' growing the document size to about 3Mb.

Are there ways of speeding this process up?

With kind regards,
Harm 
Reply | Threaded
Open this post in threaded view
|

Re: Poor performance updating

kimchy
Administrator
My guess is that it simply gets slower since you index bigger documents with more interactions. The update API still reindex the document. You might turn things around and index and interaction as its own document.

On Tuesday, February 14, 2012 at 12:16 PM, haarts wrote:

Dear list,

We were stoked when we found out about the updating feature in the recent 0.19.0rc2 release. We have been eagerly experimenting with it but are disappointed by it's performance. Hopefully you can tell us we are doing something wrong.

We roughly use this model: https://gist.github.com/1751349. Starting from a clean index it takes 7 seconds to index 1000 documents (ok-ish). After indexing 3 million documents performance degrades to 30 seconds per 1000 documents (prohibitively slow). We expect to insert 500 million documents plus 4 million a day.

Our approach inserting documents is as follows:
We first try to update a document, if that returns an error we instead create it.
The resulting documents can contain hundreds and possibly thousands of 'interactions' growing the document size to about 3Mb.

Are there ways of speeding this process up?

With kind regards,
Harm 

Reply | Threaded
Open this post in threaded view
|

Re: Poor performance updating

Karussell
In reply to this post by haarts
You could also shard or split the index which will improve indexing
speed or tune lucene options for the indexing process only (e.g.
increase merge factor)

http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html

Did you also thought about another model? E.g. feeding interactions
instead of documents? This way you avoid updating but would require
more search logic

Peter.

On 14 Feb., 11:16, haarts <[hidden email]> wrote:

> Dear list,
>
> We were stoked when we found out about the updating feature in the recent
> 0.19.0rc2 release. We have been eagerly experimenting with it but are
> disappointed by it's performance. Hopefully you can tell us we are doing
> something wrong.
>
> We roughly use this model:https://gist.github.com/1751349. Starting from a
> clean index it takes 7 seconds to index 1000 documents (ok-ish). After
> indexing 3 million documents performance degrades to 30 seconds per 1000
> documents (prohibitively slow). We expect to insert 500 million documents
> plus 4 million a day.
>
> Our approach inserting documents is as follows:
> We first try to update a document, if that returns an error we instead
> create it.
> The resulting documents can contain hundreds and possibly thousands of
> 'interactions' growing the document size to about 3Mb.
>
> Are there ways of speeding this process up?
>
> With kind regards,
> Harm
Reply | Threaded
Open this post in threaded view
|

Re: Poor performance updating

haarts
In reply to this post by kimchy
That is what I thought as well. 
You pointed out in an other reply that this parent/child functionality might be what I was looking for. I've looked into it and have one remaining question; 
I want a query searching for 'tree AND house' and returning the parent which has a child containing 'tree' and a child containing 'house'. 
Based on your Gist: my Gist.

Is such a thing possible?

With kind regards,

Reply | Threaded
Open this post in threaded view
|

Re: Poor performance updating

haarts
In reply to this post by Karussell
Ah. I will dig into these options. Thanks!

I considered an other model as well. Especially a parent/child model as to prevent reindexing the entire document. 
But I haven't been able to get a particular kind of search working with this. Imagine a particular parent having two children. One child has the content 'tree' and the other 'house', I require the search 'tree AND house' to return this parent. A concrete example can be found here. Is that even possible?

With kind regards,

Reply | Threaded
Open this post in threaded view
|

Re: Poor performance updating

kimchy
Administrator
Yes, it will work, you will use the has_child filter / query to filter those and get back the parents.

On Wednesday, February 15, 2012 at 4:17 PM, haarts wrote:

Ah. I will dig into these options. Thanks!

I considered an other model as well. Especially a parent/child model as to prevent reindexing the entire document. 
But I haven't been able to get a particular kind of search working with this. Imagine a particular parent having two children. One child has the content 'tree' and the other 'house', I require the search 'tree AND house' to return this parent. A concrete example can be found here. Is that even possible?

With kind regards,


Reply | Threaded
Open this post in threaded view
|

Re: Poor performance updating

haarts
Excellent! We'll set to work. Thank you very much for your help.