Large index design question

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Large index design question

Sky Stebnicki
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky
Reply | Threaded
Open this post in threaded view
|

Re: Large index design question

Berkay Mollamustafaoglu-2
I'd consider using separate indices and aliases. Keeping the active index smaller would help with the performance. Will you know before you index which docs are archived data and which are active? 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Tue, Apr 24, 2012 at 10:21 AM, Sky Stebnicki <[hidden email]> wrote:
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky

Reply | Threaded
Open this post in threaded view
|

Re: Large index design question

Sky Stebnicki
Sorry for my late reply, for some reason I was not notified when you posted. Documents can move from active to archived at any time so we will know as the document is indexed or updated where it belongs. In our database we split the records into two separate tables for performance reasons.

So your suggestion is to create a completely separate index or type? Sorry, I'm fairly new to the ElasticSearch terminology...

Thanks,

Sky

On Tuesday, April 24, 2012 8:40:37 AM UTC-7, Berkay Mollamustafaoglu wrote:
I'd consider using separate indices and aliases. Keeping the active index smaller would help with the performance. Will you know before you index which docs are archived data and which are active? 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Tue, Apr 24, 2012 at 10:21 AM, Sky Stebnicki <[hidden email]> wrote:
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky

Reply | Threaded
Open this post in threaded view
|

Re: Large index design question

Berkay Mollamustafaoglu-2
No worries. Yes I'd recommend 2 separate indices not types. This would allow you to optimize them differently. There are number of params that you can only set in index level. 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Thu, Apr 26, 2012 at 1:19 PM, Sky Stebnicki <[hidden email]> wrote:
Sorry for my late reply, for some reason I was not notified when you posted. Documents can move from active to archived at any time so we will know as the document is indexed or updated where it belongs. In our database we split the records into two separate tables for performance reasons.

So your suggestion is to create a completely separate index or type? Sorry, I'm fairly new to the ElasticSearch terminology...

Thanks,

Sky

On Tuesday, April 24, 2012 8:40:37 AM UTC-7, Berkay Mollamustafaoglu wrote:
I'd consider using separate indices and aliases. Keeping the active index smaller would help with the performance. Will you know before you index which docs are archived data and which are active? 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Tue, Apr 24, 2012 at 10:21 AM, Sky Stebnicki <[hidden email]> wrote:
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky


Reply | Threaded
Open this post in threaded view
|

Re: Large index design question

Sky Stebnicki
What did you mean by "aliases"? If we use two separate indexes would I have to merge/union the results at the application level if we needed to search over both active and archived data?

On Thursday, April 26, 2012 10:57:50 AM UTC-7, Berkay Mollamustafaoglu wrote:
No worries. Yes I'd recommend 2 separate indices not types. This would allow you to optimize them differently. There are number of params that you can only set in index level. 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Thu, Apr 26, 2012 at 1:19 PM, Sky Stebnicki <[hidden email]> wrote:
Sorry for my late reply, for some reason I was not notified when you posted. Documents can move from active to archived at any time so we will know as the document is indexed or updated where it belongs. In our database we split the records into two separate tables for performance reasons.

So your suggestion is to create a completely separate index or type? Sorry, I'm fairly new to the ElasticSearch terminology...

Thanks,

Sky

On Tuesday, April 24, 2012 8:40:37 AM UTC-7, Berkay Mollamustafaoglu wrote:
I'd consider using separate indices and aliases. Keeping the active index smaller would help with the performance. Will you know before you index which docs are archived data and which are active? 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Tue, Apr 24, 2012 at 10:21 AM, Sky Stebnicki <[hidden email]> wrote:
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky


Reply | Threaded
Open this post in threaded view
|

Re: Large index design question

Berkay Mollamustafaoglu-2

You don't have to merge the results yourself. ES allows you create an alias that references to multiple indices and then query the alias like a normal index. It'll do the work for you. They are quite powerful and useful. Details here: http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Thu, Apr 26, 2012 at 7:46 PM, Sky Stebnicki <[hidden email]> wrote:
What did you mean by "aliases"? If we use two separate indexes would I have to merge/union the results at the application level if we needed to search over both active and archived data?


On Thursday, April 26, 2012 10:57:50 AM UTC-7, Berkay Mollamustafaoglu wrote:
No worries. Yes I'd recommend 2 separate indices not types. This would allow you to optimize them differently. There are number of params that you can only set in index level. 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Thu, Apr 26, 2012 at 1:19 PM, Sky Stebnicki <[hidden email]> wrote:
Sorry for my late reply, for some reason I was not notified when you posted. Documents can move from active to archived at any time so we will know as the document is indexed or updated where it belongs. In our database we split the records into two separate tables for performance reasons.

So your suggestion is to create a completely separate index or type? Sorry, I'm fairly new to the ElasticSearch terminology...

Thanks,

Sky

On Tuesday, April 24, 2012 8:40:37 AM UTC-7, Berkay Mollamustafaoglu wrote:
I'd consider using separate indices and aliases. Keeping the active index smaller would help with the performance. Will you know before you index which docs are archived data and which are active? 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Tue, Apr 24, 2012 at 10:21 AM, Sky Stebnicki <[hidden email]> wrote:
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky



Reply | Threaded
Open this post in threaded view
|

Re: Large index design question

Sky Stebnicki
That is really awesome! Thanks so much for your help.

On Thursday, April 26, 2012 7:28:46 PM UTC-7, Berkay Mollamustafaoglu wrote:

You don't have to merge the results yourself. ES allows you create an alias that references to multiple indices and then query the alias like a normal index. It'll do the work for you. They are quite powerful and useful. Details here: http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Thu, Apr 26, 2012 at 7:46 PM, Sky Stebnicki <[hidden email]> wrote:
What did you mean by "aliases"? If we use two separate indexes would I have to merge/union the results at the application level if we needed to search over both active and archived data?


On Thursday, April 26, 2012 10:57:50 AM UTC-7, Berkay Mollamustafaoglu wrote:
No worries. Yes I'd recommend 2 separate indices not types. This would allow you to optimize them differently. There are number of params that you can only set in index level. 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Thu, Apr 26, 2012 at 1:19 PM, Sky Stebnicki <[hidden email]> wrote:
Sorry for my late reply, for some reason I was not notified when you posted. Documents can move from active to archived at any time so we will know as the document is indexed or updated where it belongs. In our database we split the records into two separate tables for performance reasons.

So your suggestion is to create a completely separate index or type? Sorry, I'm fairly new to the ElasticSearch terminology...

Thanks,

Sky

On Tuesday, April 24, 2012 8:40:37 AM UTC-7, Berkay Mollamustafaoglu wrote:
I'd consider using separate indices and aliases. Keeping the active index smaller would help with the performance. Will you know before you index which docs are archived data and which are active? 

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Tue, Apr 24, 2012 at 10:21 AM, Sky Stebnicki <[hidden email]> wrote:
Hi all,

We've been testing elasticsearch with our application and are really enjoying impressive performance and rock-solid stability. I have a design question about what would be the most efficient way to index (1) large set of highly active date and (2) an even larger set of archived data. Basically we have tens of millions of documents but about 80% of them are in archive state and 10-20% are read/updated 95% of the time. 

My question is this: would it be more efficient to store the archived documents into a separate type like "/index/mydata_arch" or to just use a filtered query to cache the results and flag the archived documents as we index them?

We are working on setting up benchmarks to test this ourselves in a real-word environment but I wanted to ask the experts here too and see if you had any input.

Thanks so much for your help!

Sky