Over-allocation of shards

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Over-allocation of shards

Otis Gospodnetic
Hi,

Shay mentioned/suggested over-allocation of shards on a few occasions
and I wanted to make sure I understand why.

Is this primarily to accommodate future cluster growth, so that new
nodes can be added and existing shards re-distributed / re-balanced
over all old+new nodes?

Or is there some other or additional reason for this?

Thanks,
Otis
--
Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

kimchy
Administrator
Yes, overallocation of shards is to accommodate cluster growth in certain "data flows". For example, by default, when you create an index with elasticsearch, it has 5 shards by default, even if you use just one node. This allows the index to grow up to 5 nodes (assuming no replicas and no other indices, so you have 1 shard per node) in terms of size (read perf can be increased by increasing replicas (and nodes)).

Going back to "data flow". This is the one of the most important concept to understand about your system when you use elasticsearch. Time based data (tweets, logs) can be indexed using rolling time base indices, and then, the number of shards per index are simply there so accomodate the expected size for that time frame the index is responsible for. Aliases can be used then to point to "latest" index to index to, or time base aliases can be created that point to several indices to create "bigger" time frames without needing to specify all the indices that compose that time frame.

A "user" based data flow, in theory, is perfect for an index per user case. If you have enough nodes (each shard is a Lucene index, which has a cost) in the cluster to do that, thats great, and several very large scale ES users actually do that. But, if you have a small cluster (or constraint budget / HW wise), you could say, ok, I will do all on a single index, but then, that index will need quite a few shards to support potential growth. This can be problematic without doing anything else, because if you create an index with 30 shards on a single node cluster, each search request (for that user data) will need to span all 30 shards.

This is where routing comes to play. You can have all the data for a user allocated to a specific shard using routing (with clever use of routing, you can actually have user data span several shards, for example, using time base routing included in the user routing value).

This means that when you search for that user data, and provide the routing value, it will only go to a single shard. You can create an index with 100 shards, and you won't incur the overhead of searching across all shards for specific user data. Obviously, when you do that, you also need to filter each query to only have that specific user data. Remember also, that you can still search across several users data by simply using several routing values.

For this usecase, aliases really shine. You can have a single index, lets call it users, with 50 shards. And define an alias for each user. Lets say the first user is user1, you define an alias called user1. That alias can have a routing value of "user1", which means when you index and search against that alais (as if it was an "index', i.e. /user1/_search), the routing value will be automatically applied. But, you still need to filter to only see that user data. For that, an alias can also be associated with a filter (i.e. a term filter on the user name with the user value). Then, every time you search against /user1/_search, the filter will be automatically applied.

One nice aspect of it is that you can also search across several aliases. For example, /user1,user2/_search (as you can search over multiple indices), and it will "do the right thing". It will do a search with two routing values, and an "OR'ed" filter.

-shay.banon

On Fri, Jan 20, 2012 at 12:40 AM, Otis Gospodnetic <[hidden email]> wrote:
Hi,

Shay mentioned/suggested over-allocation of shards on a few occasions
and I wanted to make sure I understand why.

Is this primarily to accommodate future cluster growth, so that new
nodes can be added and existing shards re-distributed / re-balanced
over all old+new nodes?

Or is there some other or additional reason for this?

Thanks,
Otis
--
Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

Eric Jain
On Jan 19, 3:02 pm, Shay Banon <[hidden email]> wrote:
> Yes, overallocation of shards is to accommodate cluster growth in certain
> "data flows". For example, by default, when you create an index with
> elasticsearch, it has 5 shards by default, even if you use just one node.
> This allows the index to grow up to 5 nodes (assuming no replicas and no
> other indices, so you have 1 shard per node) in terms of size (read perf
> can be increased by increasing replicas (and nodes)).

I've wondered about the default setting of 5 shards as well. If I know
how much data I'm indexing (e.g. indexing an archive of tweets), I'll
want to change that setting. If I don't (e.g. indexing tweets in
"realtime"), I want to start with 1 shard (good enough for most
users), and re-shard to more than that only when the number of
documents passes a threshold. This requires rebuilding the index;
would be nice if elasticsearch had a simple command to do that.
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

Otis Gospodnetic
In reply to this post by kimchy
Thank you Shay for a thorough answer - we've been making use of
exactly these index alias/sharding/routing functionality in a number
of projects where we used ElasticSearch.  As a matter of fact, this is
from 4 hours ago: http://twitter.com/#!/otisg/status/160128277497913345

And thanks for confirming the over-allocation bit!

Otis
--
Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

On Jan 19, 6:02 pm, Shay Banon <[hidden email]> wrote:

> Yes, overallocation of shards is to accommodate cluster growth in certain
> "data flows". For example, by default, when you create an index with
> elasticsearch, it has 5 shards by default, even if you use just one node.
> This allows the index to grow up to 5 nodes (assuming no replicas and no
> other indices, so you have 1 shard per node) in terms of size (read perf
> can be increased by increasing replicas (and nodes)).
>
> Going back to "data flow". This is the one of the most important concept to
> understand about your system when you use elasticsearch. Time based data
> (tweets, logs) can be indexed using rolling time base indices, and then,
> the number of shards per index are simply there so accomodate the expected
> size for that time frame the index is responsible for. Aliases can be used
> then to point to "latest" index to index to, or time base aliases can be
> created that point to several indices to create "bigger" time frames
> without needing to specify all the indices that compose that time frame.
>
> A "user" based data flow, in theory, is perfect for an index per user case.
> If you have enough nodes (each shard is a Lucene index, which has a cost)
> in the cluster to do that, thats great, and several very large scale ES
> users actually do that. But, if you have a small cluster (or constraint
> budget / HW wise), you could say, ok, I will do all on a single index, but
> then, that index will need quite a few shards to support potential growth.
> This can be problematic without doing anything else, because if you create
> an index with 30 shards on a single node cluster, each search request (for
> that user data) will need to span all 30 shards.
>
> This is where routing comes to play. You can have all the data for a user
> allocated to a specific shard using routing (with clever use of routing,
> you can actually have user data span several shards, for example, using
> time base routing included in the user routing value).
>
> This means that when you search for that user data, and provide the routing
> value, it will only go to a single shard. You can create an index with 100
> shards, and you won't incur the overhead of searching across all shards for
> specific user data. Obviously, when you do that, you also need to filter
> each query to only have that specific user data. Remember also, that you
> can still search across several users data by simply using several routing
> values.
>
> For this usecase, aliases really shine. You can have a single index, lets
> call it users, with 50 shards. And define an alias for each user. Lets say
> the first user is user1, you define an alias called user1. That alias can
> have a routing value of "user1", which means when you index and search
> against that alais (as if it was an "index', i.e. /user1/_search), the
> routing value will be automatically applied. But, you still need to filter
> to only see that user data. For that, an alias can also be associated with
> a filter (i.e. a term filter on the user name with the user value). Then,
> every time you search against /user1/_search, the filter will be
> automatically applied.
>
> One nice aspect of it is that you can also search across several aliases.
> For example, /user1,user2/_search (as you can search over multiple
> indices), and it will "do the right thing". It will do a search with two
> routing values, and an "OR'ed" filter.
>
> -shay.banon
>
> On Fri, Jan 20, 2012 at 12:40 AM, Otis Gospodnetic <
>
>
>
>
>
>
>
> [hidden email]> wrote:
> > Hi,
>
> > Shay mentioned/suggested over-allocation of shards on a few occasions
> > and I wanted to make sure I understand why.
>
> > Is this primarily to accommodate future cluster growth, so that new
> > nodes can be added and existing shards re-distributed / re-balanced
> > over all old+new nodes?
>
> > Or is there some other or additional reason for this?
>
> > Thanks,
> > Otis
> > --
> > Sematext is Hiring World-Wide --http://sematext.com/about/jobs.html
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

Karussell
In reply to this post by Eric Jain
Hi Eric

> I've wondered about the default setting of 5 shards as well. If I know
> how much data I'm indexing (e.g. indexing an archive of tweets), I'll
> want to change that setting.

You can specify that when creating the index

> If I don't (e.g. indexing tweets in
> "realtime"), I want to start with 1 shard (good enough for most
> users), and re-shard to more than that only when the number of
> documents passes a threshold. This requires rebuilding the index;
> would be nice if elasticsearch had a simple command to do that.

you can use several indices and create a new index if  documents
passes a threshold (+ use the index alias feature) ...
No need to recreate shards ...

Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

Eric Jain
On Fri, Jan 20, 2012 at 03:00, Karussell <[hidden email]> wrote:
>> I've wondered about the default setting of 5 shards as well. If I know
>> how much data I'm indexing (e.g. indexing an archive of tweets), I'll
>> want to change that setting.
>
> You can specify that when creating the index

That's what I do; just pointing out that I never had a case where I
wanted to start an index with precisely 5 shards.

I'm assuming that queries on an index with 5 shards running on a
single machine are going to be slower and use more resources than
queries on an index with a single shard, right?


> you can use several indices and create a new index if  documents
> passes a threshold (+ use the index alias feature) ...
> No need to recreate shards ...

Hadn't considered that approach! I imagine that internally, shards are
similar to aliased indexes? So if there was a simple way to replace
"partition by hash" with "partition by row count" (and have shards
created lazily)...
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

Karussell
> on a single machine are going to be slower

yes, of course.

> Hadn't considered that approach! I imagine that internally, shards are
> similar to aliased indexes? So if there was a simple way to replace
> "partition by hash" with "partition by row count" (and have shards
> created lazily)...

there is a similar feature (not really what you are after) .. via
routing. but creating 'shards' lazily can only be done via creating
'indices' which are under your control.

Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

chenry12
Hi,

When you use this method  : 1 index + routing.
It is not possible to apply an analyser by alias as when we work with an index, isnt it ?

I have an other question:
I have listened the Shay Banon's presentation
http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

and I have a question:
I dont understand exactly the difference to build 1 shard per index and this method to build 1 index + routing.
I understood that it's better for the system ressource but not exactly why ?

Thanks in advance,
Christophe.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Over-allocation of shards

chenry12
Ok, i understood my second question.
More of one user can use a same shard.

And are you ok with my first question ?

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.