Large SOLR migration recommendations (3.6 Billion+ docs)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Large SOLR migration recommendations (3.6 Billion+ docs)

ocryan-2
We are considering migrating from Solr to ES and are planning our server/shard setup now. I thought that perhaps the group has some recommendations based on our current Solr environment. Currently we have:

4xDedicated Solr Machines with 8-core CPU and 96GB RAM and 8x500GB SSD using RAID5
Each solr server has 10 cores (indexes) and each of those 10 cores is sharded across the 4 servers
Each of those individual cores currently on average occupies 200GB of disk space and has approximately 90 million documents.

This brings our total index space to 8TB and 3.6 billion documents. We index about 10 million new documents each day.

So my question is, if designing a new ES deployment architecture what would be the suggested number of servers and number of shards? This is a multi-tenant environment so is it better to separate the tenants (2,000+) by index or by document type? Also, the solr index is currently NOT storing the document content, but we would like ES to so our index size will increase substantially.

Thank you in advance for your time.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Large SOLR migration recommendations (3.6 Billion+ docs)

Karel Minařík
> So my question is, if designing a new ES deployment architecture what would be the suggested number of servers and number of shards?

I'm afraid nobody can't give you any specific numbers with real confidence -- the infrastructure depends not only on the number of documents or the index size, but also on query type (sorting, how much? faceting, how much? heavy queries?), query throughput, your desired response latency, etc.

The usual advice to get some feeling and understanding of the requirements is:

1/ Create some shielded, controlled, limited environment (CPU-wise, IO-wise, RAM-wise), either VM or a machine
2/ Create an index with *one* single shard, and pour some documents based on your real data into it
3/ Fire some example queries, based on your projected usage
4/ Continue pouring documents into this index, while performing the queries, and *stop* when you feel or know it's too much for the single shard on this infrastructure to handle.

Now, you have an approximate number of docs/size one single shard can handle. Now you have to estimate how much of these shards a real machine can handle. From that point on, everything is simply a matter of multiplication :)

> This is a multi-tenant environment so is it better to separate the tenants (2,000+) by index or by document type?

If it makes sense for you to query a specific portion of the corpus, based on "account ID", definitely have a look into the "routing" and "index aliases" feature. Shay talks about some concrete strategies in this presentation: http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

> Also, the solr index is currently NOT storing the document content, but we would like ES to so our index size will increase substantially.

To save disk space when you'd like store and server the documents itself, consider using the `compress` option (http://www.elasticsearch.org/guide/reference/mapping/source-field.html)

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Large SOLR migration recommendations (3.6 Billion+ docs)

ocryan-2
Thank you for your response.

Routing looks very good for our situation, yes. We can effectively limit each query to a single shard quite easily in this manner. Given this, query performance will not be affected by the number of shards directly as no query will span multiple shards. However, there has to be some overhead for each additional shard. Is there some idea of what this overhead is? Would it be reasonable to have 100 shards? 1000 shards?

I setup a simple 3 node cluster and initialized a new index with 1000 shards and 1 replica and it didn't take long to go green.

On Tuesday, February 5, 2013 11:23:42 PM UTC-8, Karel Minařík wrote:
> So my question is, if designing a new ES deployment architecture what would be the suggested number of servers and number of shards?

I'm afraid nobody can't give you any specific numbers with real confidence -- the infrastructure depends not only on the number of documents or the index size, but also on query type (sorting, how much? faceting, how much? heavy queries?), query throughput, your desired response latency, etc.

The usual advice to get some feeling and understanding of the requirements is:

1/ Create some shielded, controlled, limited environment (CPU-wise, IO-wise, RAM-wise), either VM or a machine
2/ Create an index with *one* single shard, and pour some documents based on your real data into it
3/ Fire some example queries, based on your projected usage
4/ Continue pouring documents into this index, while performing the queries, and *stop* when you feel or know it's too much for the single shard on this infrastructure to handle.

Now, you have an approximate number of docs/size one single shard can handle. Now you have to estimate how much of these shards a real machine can handle. From that point on, everything is simply a matter of multiplication :)

> This is a multi-tenant environment so is it better to separate the tenants (2,000+) by index or by document type?

If it makes sense for you to query a specific portion of the corpus, based on "account ID", definitely have a look into the "routing" and "index aliases" feature. Shay talks about some concrete strategies in this presentation: http://www.elasticsearch.org/videos/2012/06/05/big-data-search-and-analytics.html

> Also, the solr index is currently NOT storing the document content, but we would like ES to so our index size will increase substantially.

To save disk space when you'd like store and server the documents itself, consider using the `compress` option (http://www.elasticsearch.org/guide/reference/mapping/source-field.html)

Karel

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Large SOLR migration recommendations (3.6 Billion+ docs)

Karel Minařík
> However, there has to be some overhead for each additional shard. Is there some idea of what this overhead is? Would it be reasonable to have 100 shards? 1000 shards?

Depends on how many nodes you have, to put all those shards on :)

> I setup a simple 3 node cluster and initialized a new index with 1000 shards and 1 replica and it didn't take long to go green.

That's one way how to stress-test it. But a more reasonble way is the "overload one shard" technique I have described. Because "scaling out" means just placing all those shards (= Lucene indices) *somewhere*. Given you're not even hiting all/many of them, it plays _very_ well into your cards.

Karel


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: Large SOLR migration recommendations (3.6 Billion+ docs)

Otis Gospodnetic
In reply to this post by ocryan-2
Hi,

I would think the requirements would be roughly the same for Solr and ES.  If you are comparing and doing performance testing, hardware capacity planning, and such, you may find SPM for ES and Solr helpful - http://sematext.com/spm/index.html .
5+ years ago I ran a site that had >300K Lucene indices on a single dual-core server with 8 GB of RAM.  The indices would get closed when not in use, but crawlers were allowed to crawl the site and triggered all kinds of queries, so you can imagine what was happening.  So I would consider index-per-tenant approach first.  Oh, and I know one of the SPM users has a 15-node ES cluster with close to 4000 indices and I don't know how many shards.  I don't think their servers are better than yours.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/


On Tuesday, February 5, 2013 10:26:21 AM UTC-5, ocryan wrote:
We are considering migrating from Solr to ES and are planning our server/shard setup now. I thought that perhaps the group has some recommendations based on our current Solr environment. Currently we have:

4xDedicated Solr Machines with 8-core CPU and 96GB RAM and 8x500GB SSD using RAID5
Each solr server has 10 cores (indexes) and each of those 10 cores is sharded across the 4 servers
Each of those individual cores currently on average occupies 200GB of disk space and has approximately 90 million documents.

This brings our total index space to 8TB and 3.6 billion documents. We index about 10 million new documents each day.

So my question is, if designing a new ES deployment architecture what would be the suggested number of servers and number of shards? This is a multi-tenant environment so is it better to separate the tenants (2,000+) by index or by document type? Also, the solr index is currently NOT storing the document content, but we would like ES to so our index size will increase substantially.

Thank you in advance for your time.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Large SOLR migration recommendations (3.6 Billion+ docs)

ocryan-2
Index per tenant would be the easiest and most simple solution. That was my original thought, I was just concerned with overhead of each additional index. Then there's other logistical concerns such as simply using the elasticsearch-head plugin, will it handle displaying 2,000+ indexes? I think I'll start with this approach and run some tests to find out. I'll report back here for the benefit of the group.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Large SOLR migration recommendations (3.6 Billion+ docs)

Mike
In addition to what was mentioned above about setting "_source" : { "compress" : "true" }, you should also consider setting "_all" : { "enabled" : "false" } to save even more space.


On Wednesday, February 13, 2013 9:05:40 AM UTC-5, ocryan wrote:
Index per tenant would be the easiest and most simple solution. That was my original thought, I was just concerned with overhead of each additional index. Then there's other logistical concerns such as simply using the elasticsearch-head plugin, will it handle displaying 2,000+ indexes? I think I'll start with this approach and run some tests to find out. I'll report back here for the benefit of the group.

Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.