When/Why to use Routing for indexing/searching

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

When/Why to use Routing for indexing/searching

Paul Smith
I'm intrigued by the routing property that can be used during indexing/searching.  The documentation sort of explains how to use it, but I feel like it's missing some recommendations on _why_ one should use it; under what conditions is it a good idea to start using this feature and when using it isn't such a good idea.  It doesn't appear to really be necessary to specify the routing parameter at all, just there for a good reason.

Obviously parent/child requires the child to be hosted on the same shard as the parent, so that makes sense, but I get a sense there's also an optimization for searches here that hitting shards that don't have any results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their 'projectid' which is for our domain, the central 'root' concern of all data elements, everything pretty much belongs to a project, so when indexing, should I be using a routing based on the projectID so that all project-related information indexed is nicely co-located together?    Generally people search on a project basis, but sometimes they want to search across multiple projects, so we'd need to be able to spread that search cross-project.

Does routing however create 'hot' shards that get hit more than others? Does that matter anyway with replicas to distribute the load?  I see from the recent yFrog post they use routing (see [1] if you missed that, thanks for that great post!).  All users data is forced into that same shard as I understand reading that post, but it's not clear to me what benefit that has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith
Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

Clinton Gormley-2
Hiya Paul

On Thu, 2012-03-01 at 16:51 +1100, Paul Smith wrote:
> I'm intrigued by the routing property that can be used during
> indexing/searching.  

Me too.  I'm in the process of writing a framework to use ES as my sole
data store. Although I'm not using routing myself yet, the framework has
to support it.

> The documentation sort of explains how to use it, but I feel like it's
> missing some recommendations on _why_ one should use it; under what
> conditions is it a good idea to start using this feature and when
> using it isn't such a good idea.  It doesn't appear to really be
> necessary to specify the routing parameter at all, just there for a
> good reason.

The main use I can see is if you have a large dataset (ie you need lots
of shards) which is easily sub-divided.

For instance, we have a single application which runs multiple
white-label sites for many different clients. While we occasionally need
to search across all clients, the web app only ever needs to query one
client at a time.  So instead of hitting 10 shards, we could just hit
one, if we use the client ID for routing.

I suppose you could also say: use the user_id for routing for anything
that belongs to the user (even if there isn't an enforced parent-child
relationship).

This does introduce a complication though.  A unique ID for a document
in a cluster actually consists of Index, Type, ID and Routing.  It is
quite possible to have two docs with the same Index, Type and ID if you
specify different Routing values (although this would be unwise, as your
Routing value may end up being hashed to the same shard without you
realising it).

>
> Does routing however create 'hot' shards that get hit more than
> others? Does that matter anyway with replicas to distribute the load?

Potentially, yes.  It is not obvious which shard you routing value would
point to.  So (eg in my route-by-client-id example) we could end up with
our two biggest clients on the same shard, and our two smallest on the
same shard.  

>   I see from the recent yFrog post they use routing (see [1] if you
> missed that, thanks for that great post!).  All users data is forced
> into that same shard as I understand reading that post, but it's not
> clear to me what benefit that has in the search case.

I'd be very interested in hearing how people are using routing.  How
'dynamic' is the routing value?
>
clint


Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

kimchy
Administrator
In reply to this post by Paul Smith
Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.

On Thursday, March 1, 2012 at 7:51 AM, Paul Smith wrote:

I'm intrigued by the routing property that can be used during indexing/searching.  The documentation sort of explains how to use it, but I feel like it's missing some recommendations on _why_ one should use it; under what conditions is it a good idea to start using this feature and when using it isn't such a good idea.  It doesn't appear to really be necessary to specify the routing parameter at all, just there for a good reason.

Obviously parent/child requires the child to be hosted on the same shard as the parent, so that makes sense, but I get a sense there's also an optimization for searches here that hitting shards that don't have any results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their 'projectid' which is for our domain, the central 'root' concern of all data elements, everything pretty much belongs to a project, so when indexing, should I be using a routing based on the projectID so that all project-related information indexed is nicely co-located together?    Generally people search on a project basis, but sometimes they want to search across multiple projects, so we'd need to be able to spread that search cross-project.

Does routing however create 'hot' shards that get hit more than others? Does that matter anyway with replicas to distribute the load?  I see from the recent yFrog post they use routing (see [1] if you missed that, thanks for that great post!).  All users data is forced into that same shard as I understand reading that post, but it's not clear to me what benefit that has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith

Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

Michael Sick
Shay,

Will a single value for a routing id (say in this case "project123") always resove to a single shard or will ES manage multiple shards per routing value if we exceed a size threshold? 

--Mike

On Thu, Mar 1, 2012 at 7:40 AM, Shay Banon <[hidden email]> wrote:
Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.

On Thursday, March 1, 2012 at 7:51 AM, Paul Smith wrote:

I'm intrigued by the routing property that can be used during indexing/searching.  The documentation sort of explains how to use it, but I feel like it's missing some recommendations on _why_ one should use it; under what conditions is it a good idea to start using this feature and when using it isn't such a good idea.  It doesn't appear to really be necessary to specify the routing parameter at all, just there for a good reason.

Obviously parent/child requires the child to be hosted on the same shard as the parent, so that makes sense, but I get a sense there's also an optimization for searches here that hitting shards that don't have any results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their 'projectid' which is for our domain, the central 'root' concern of all data elements, everything pretty much belongs to a project, so when indexing, should I be using a routing based on the projectID so that all project-related information indexed is nicely co-located together?    Generally people search on a project basis, but sometimes they want to search across multiple projects, so we'd need to be able to spread that search cross-project.

Does routing however create 'hot' shards that get hit more than others? Does that matter anyway with replicas to distribute the load?  I see from the recent yFrog post they use routing (see [1] if you missed that, thanks for that great post!).  All users data is forced into that same shard as I understand reading that post, but it's not clear to me what benefit that has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith


Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

Frederic
In reply to this post by Clinton Gormley-2
We do have a scenario where we use routing for a productive system in my company (the main ecommerce site for LATAM)

The site manages million of users selling million of items, so that each user needs to manage its own items in a 'My Account' site. So, for such system, we created an ES index for all items and search them using the seller ID as a filter.

At indexing time we route documents using the seller ID, so that items from the same seller go to the same shard/s and thus we balance the required space among servers (we can consider the distribution of items per user normal). For improving search times, as Shay said, we route user searches using the same seller ID so that they hit the right servers and not all of them.

So, although of course some sellers have far more items than the rest, and that may slightly overload some servers, the number of users is big enough to have the load perfectly distributed using this schema. At the other hand, a possible alternative to this approach would be to have 1 index per user, but we weren't sure how ES would beahve with million of indices and we preferred to have all items under the same "umbrella" as it simplify administration tasks when we need to search for items accross all users.

Hope it helps! Cheers,
Frederic

On Thursday, 1 March 2012 06:52:22 UTC-3, Clinton Gormley wrote:
Hiya Paul

On Thu, 2012-03-01 at 16:51 +1100, Paul Smith wrote:
> I'm intrigued by the routing property that can be used during
> indexing/searching.  

Me too.  I'm in the process of writing a framework to use ES as my sole
data store. Although I'm not using routing myself yet, the framework has
to support it.

> The documentation sort of explains how to use it, but I feel like it's
> missing some recommendations on _why_ one should use it; under what
> conditions is it a good idea to start using this feature and when
> using it isn't such a good idea.  It doesn't appear to really be
> necessary to specify the routing parameter at all, just there for a
> good reason.

The main use I can see is if you have a large dataset (ie you need lots
of shards) which is easily sub-divided.

For instance, we have a single application which runs multiple
white-label sites for many different clients. While we occasionally need
to search across all clients, the web app only ever needs to query one
client at a time.  So instead of hitting 10 shards, we could just hit
one, if we use the client ID for routing.

I suppose you could also say: use the user_id for routing for anything
that belongs to the user (even if there isn't an enforced parent-child
relationship).

This does introduce a complication though.  A unique ID for a document
in a cluster actually consists of Index, Type, ID and Routing.  It is
quite possible to have two docs with the same Index, Type and ID if you
specify different Routing values (although this would be unwise, as your
Routing value may end up being hashed to the same shard without you
realising it).

>
> Does routing however create 'hot' shards that get hit more than
> others? Does that matter anyway with replicas to distribute the load?

Potentially, yes.  It is not obvious which shard you routing value would
point to.  So (eg in my route-by-client-id example) we could end up with
our two biggest clients on the same shard, and our two smallest on the
same shard.  

>   I see from the recent yFrog post they use routing (see [1] if you
> missed that, thanks for that great post!).  All users data is forced
> into that same shard as I understand reading that post, but it's not
> clear to me what benefit that has in the search case.

I'd be very interested in hearing how people are using routing.  How
'dynamic' is the routing value?
>
clint


Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

kimchy
Administrator
In reply to this post by Michael Sick
It will always resolve to a single shard.

On Thursday, March 1, 2012 at 4:38 PM, Michael Sick wrote:

Shay,

Will a single value for a routing id (say in this case "project123") always resove to a single shard or will ES manage multiple shards per routing value if we exceed a size threshold? 

--Mike

On Thu, Mar 1, 2012 at 7:40 AM, Shay Banon <[hidden email]> wrote:
Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.

On Thursday, March 1, 2012 at 7:51 AM, Paul Smith wrote:

I'm intrigued by the routing property that can be used during indexing/searching.  The documentation sort of explains how to use it, but I feel like it's missing some recommendations on _why_ one should use it; under what conditions is it a good idea to start using this feature and when using it isn't such a good idea.  It doesn't appear to really be necessary to specify the routing parameter at all, just there for a good reason.

Obviously parent/child requires the child to be hosted on the same shard as the parent, so that makes sense, but I get a sense there's also an optimization for searches here that hitting shards that don't have any results for that parameter is another good use case for routing.

I'm wondering for our use case whether we should be indexing items by their 'projectid' which is for our domain, the central 'root' concern of all data elements, everything pretty much belongs to a project, so when indexing, should I be using a routing based on the projectID so that all project-related information indexed is nicely co-located together?    Generally people search on a project basis, but sometimes they want to search across multiple projects, so we'd need to be able to spread that search cross-project.

Does routing however create 'hot' shards that get hit more than others? Does that matter anyway with replicas to distribute the load?  I see from the recent yFrog post they use routing (see [1] if you missed that, thanks for that great post!).  All users data is forced into that same shard as I understand reading that post, but it's not clear to me what benefit that has in the search case.

Can anyone else explain why they use routing for their domain ?

thanks!

Paul Smith



Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

Paul Smith
In reply to this post by kimchy


On 1 March 2012 23:40, Shay Banon <[hidden email]> wrote:
Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.



That's a great link Shay, thanks very much.  Hitting a single shard and not needing to merge the per-shard result sets is a nice local optimization, particularly when the size of the result set may be large (the pathological case of needing 'all' results sorted is exacerbated when split across shards, as I understand it).

I was thinking about our project distribution, some projects are huge, many tens of millions of items, whereas some are quite small, and I wondered about the problems of 'hotness' of a specific shard, but I think it ends up being better when it's only querying a single shard and the hotness is good for the filesystem cache anyway.  Replica's is the way to then distribute the read load of searches.

thanks again,

Paul
Reply | Threaded
Open this post in threaded view
|

Re: When/Why to use Routing for indexing/searching

kimchy
Administrator
Usually, for people using the routing feature and are concerned with "hotness" of a specific user/project, is that if a specific project/user becomes really big, you can always move it to its own index. Aliases allow you do it without affecting the client code, so instead of having an alias with routing value and filter pointing to your multi tenant index, you will move it to point to an index that is associated only with the mentioned large project.

On Friday, March 2, 2012 at 12:50 AM, Paul Smith wrote:



On 1 March 2012 23:40, Shay Banon <[hidden email]> wrote:
Heya, yea, what you mention, having a projectId as the routing can be a nice optimization, since then when you search on a project, you can just do the search on a single shard instead of broadcast across shards. This allows to have a considerably higher number of shards on the "products" index. I go into detail about it here: https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/data$20flow/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ.



That's a great link Shay, thanks very much.  Hitting a single shard and not needing to merge the per-shard result sets is a nice local optimization, particularly when the size of the result set may be large (the pathological case of needing 'all' results sorted is exacerbated when split across shards, as I understand it).

I was thinking about our project distribution, some projects are huge, many tens of millions of items, whereas some are quite small, and I wondered about the problems of 'hotness' of a specific shard, but I think it ends up being better when it's only querying a single shard and the hotness is good for the filesystem cache anyway.  Replica's is the way to then distribute the read load of searches.

thanks again,

Paul

Reply | Threaded
Open this post in threaded view
|

Routing and Shard Size

ElanKeith
This post has NOT been accepted by the mailing list yet.
In reply to this post by kimchy
Shay indicated here that routing will always resolve to a single shard.

So, is ES maintaining an internal "index" (by "index", i mean someway of identifying the specific shard for a specific routing id)?  

ES would need to somehow map a specific routing id to a specific shard, I would presume.

If so, is there an overhead of maintaining such a relationship? (from memory used and insertion time standpoint)?  For e.g., What is the overhead in how ES processes a  document coming in without needing a routing would  VS a document coming in needing a specific routing?

 
"http://elasticsearch-users.115913.n3.nabble.com/When-Why-to-use-Routing-for-indexing-searching-td3789570.html#a3790713"

Thanks in advance for responses,
Elan.