Snapshot Scaling Problems

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Snapshot Scaling Problems

Andy Nemzek
Hello,

My company is using the ELK stack.  Right now we have a very small amount of data actually being sent to elastic search (probably a couple hundred logstash entries a day if that), however, the data that is getting logged is very important.  I recently set up snapshots to help protect this data.  

I take 1 snapshot a day, I delete snapshots that are older than 20 days, and each snapshot is comprised of all the logstash indexes in elasticsearch.  It's also a business requirement that we are able to search at least a year's worth of data, so I can't close logstash indexes unless they're older than at least a year.

Now, we've been using logstash for several months and each day it creates a new index.  We've found that even though there is very little data in these indexes, it's taking upwards of 30 minutes to take a snapshot of all of them and each day it appears to take 20 - 100 seconds longer than the last.  It is also taking about 30 minutes to delete a single snapshot, which is done each day as a part of cleaning up old snapshots.  So, the whole process is is taking about an hour each day and appears to be growing longer very quickly.

Am I doing something wrong here or is this kind of thing expected?  It's seems pretty strange that it's taking so long with the little amount of data we have.  I've looked through the snapshot docs several times and there doesn't appear to be much talk about how the process scales.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Mark Walkom-2
How many indices are there, are you using the default shard count (5)? Are you optimising older indices?

The snapshot takes segments, so it may be that there is a lot of them to copy. You could try optimising your old indices, eg older than 7 days, down to a single segment and then see if that helps.

Be aware though, the optimise is a resource heavy operation, so unless you have a lot of resources you should only run one at a time.

On 10 March 2015 at 05:18, Andy Nemzek <[hidden email]> wrote:
Hello,

My company is using the ELK stack.  Right now we have a very small amount of data actually being sent to elastic search (probably a couple hundred logstash entries a day if that), however, the data that is getting logged is very important.  I recently set up snapshots to help protect this data.  

I take 1 snapshot a day, I delete snapshots that are older than 20 days, and each snapshot is comprised of all the logstash indexes in elasticsearch.  It's also a business requirement that we are able to search at least a year's worth of data, so I can't close logstash indexes unless they're older than at least a year.

Now, we've been using logstash for several months and each day it creates a new index.  We've found that even though there is very little data in these indexes, it's taking upwards of 30 minutes to take a snapshot of all of them and each day it appears to take 20 - 100 seconds longer than the last.  It is also taking about 30 minutes to delete a single snapshot, which is done each day as a part of cleaning up old snapshots.  So, the whole process is is taking about an hour each day and appears to be growing longer very quickly.

Am I doing something wrong here or is this kind of thing expected?  It's seems pretty strange that it's taking so long with the little amount of data we have.  I've looked through the snapshot docs several times and there doesn't appear to be much talk about how the process scales.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-wCPeyx7mYkan7sN0BK8suHY3RR7bCU19dN0Qn%3DpyALA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Andy Nemzek
Hi Mark,

Thanks for the reply.  

We've been using logstash for several months now and it creates a new index each day, so I imagine there are over 100 indexes at this point.  

Elasticsearch is running on a single machine...I haven't done anything with shards, so the defaults must be in use.  Haven't optimized old indexes.  We're pretty much just running ELK out of the box.

When you mention 'optimizing indexes', does this process combine indexes?  Do you know if these performance problems are typical when using ELK out of the box?



On Monday, March 9, 2015 at 1:58:33 PM UTC-5, Mark Walkom wrote:
How many indices are there, are you using the default shard count (5)? Are you optimising older indices?

The snapshot takes segments, so it may be that there is a lot of them to copy. You could try optimising your old indices, eg older than 7 days, down to a single segment and then see if that helps.

Be aware though, the optimise is a resource heavy operation, so unless you have a lot of resources you should only run one at a time.

On 10 March 2015 at 05:18, Andy Nemzek <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="JIbpqs-tMhoJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bitk...@...> wrote:
Hello,

My company is using the ELK stack.  Right now we have a very small amount of data actually being sent to elastic search (probably a couple hundred logstash entries a day if that), however, the data that is getting logged is very important.  I recently set up snapshots to help protect this data.  

I take 1 snapshot a day, I delete snapshots that are older than 20 days, and each snapshot is comprised of all the logstash indexes in elasticsearch.  It's also a business requirement that we are able to search at least a year's worth of data, so I can't close logstash indexes unless they're older than at least a year.

Now, we've been using logstash for several months and each day it creates a new index.  We've found that even though there is very little data in these indexes, it's taking upwards of 30 minutes to take a snapshot of all of them and each day it appears to take 20 - 100 seconds longer than the last.  It is also taking about 30 minutes to delete a single snapshot, which is done each day as a part of cleaning up old snapshots.  So, the whole process is is taking about an hour each day and appears to be growing longer very quickly.

Am I doing something wrong here or is this kind of thing expected?  It's seems pretty strange that it's taking so long with the little amount of data we have.  I've looked through the snapshot docs several times and there doesn't appear to be much talk about how the process scales.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="JIbpqs-tMhoJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/13241ad4-9ae4-4ac6-b5e9-421a5c62b898%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Andy Nemzek
In reply to this post by Mark Walkom-2
I forgot to mention that we're also using the S3 snapshot plugin to store the snapshots in an S3 bucket.  Perhaps this might be part of the performance problems?


On Monday, March 9, 2015 at 1:58:33 PM UTC-5, Mark Walkom wrote:
How many indices are there, are you using the default shard count (5)? Are you optimising older indices?

The snapshot takes segments, so it may be that there is a lot of them to copy. You could try optimising your old indices, eg older than 7 days, down to a single segment and then see if that helps.

Be aware though, the optimise is a resource heavy operation, so unless you have a lot of resources you should only run one at a time.

On 10 March 2015 at 05:18, Andy Nemzek <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="JIbpqs-tMhoJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bitk...@...> wrote:
Hello,

My company is using the ELK stack.  Right now we have a very small amount of data actually being sent to elastic search (probably a couple hundred logstash entries a day if that), however, the data that is getting logged is very important.  I recently set up snapshots to help protect this data.  

I take 1 snapshot a day, I delete snapshots that are older than 20 days, and each snapshot is comprised of all the logstash indexes in elasticsearch.  It's also a business requirement that we are able to search at least a year's worth of data, so I can't close logstash indexes unless they're older than at least a year.

Now, we've been using logstash for several months and each day it creates a new index.  We've found that even though there is very little data in these indexes, it's taking upwards of 30 minutes to take a snapshot of all of them and each day it appears to take 20 - 100 seconds longer than the last.  It is also taking about 30 minutes to delete a single snapshot, which is done each day as a part of cleaning up old snapshots.  So, the whole process is is taking about an hour each day and appears to be growing longer very quickly.

Am I doing something wrong here or is this kind of thing expected?  It's seems pretty strange that it's taking so long with the little amount of data we have.  I've looked through the snapshot docs several times and there doesn't appear to be much talk about how the process scales.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="JIbpqs-tMhoJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/0add3377-4b49-4a82-a233-e005113ab1b9%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e5844cef-e8e2-4822-b77b-b6606b409eb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Magnus Bäck
In reply to this post by Andy Nemzek
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <[hidden email]> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
[hidden email] | Sony Mobile Communications

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/20150311071150.GB5729%40seldlx20533.corpusers.net.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Aaron Mefford
With the low volume of ingest, and the long duration of history, Id suggest you may want to trim back the number of shards per index from the default 5.  Based on your 100 docs per day Id say 1 shard per day.  If you combined this with the other suggestion to increase the duration of an index, then you might increase the number of shards, but maybe still not.  Running an optimize once you have completed a time period is great advice if you can afford the overhead, sounds like one day at a time you should be able to, and that the overhead of not optimizing is costing you more when you snapshot.

And index is made of shards, a shard is made of lucene segments.  Lucene segments are the actual files that you copy when you snapshot.  As such the number of segments is multiplied by the number of shards per index and the number of indexes.  Reducing the number of indexes by creating larger time periods will significantly reduce the number of segments.  Reducing the number of shards per index will significantly reduce the number of segments.  Optimizing the index will also consolidate many segments into a single segment.

Based on the use of S3 should we assume you are using AWS EC2?  What instance size?  Your data volume seems very low so it seems concerning that you have such a large time period to snapshot, and points to a slow file system, or a significant number of segments (100 indexes, 5 shards per index, xx segments per shard, == many thousands of segments).  What does your storage system look like?  If you are using EC2 are you using the newer EBS volumes (SSD backed)? Some of the smaller instance size significantly limit prolonged EBS throughput, in my experience. 

On Wednesday, March 11, 2015 at 1:12:01 AM UTC-6, Magnus Bäck wrote:
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="e-loB6v2EdsJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bitk...@...> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
<a href="javascript:" target="_blank" gdf-obfuscated-mailto="e-loB6v2EdsJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">magnu...@... | Sony Mobile Communications

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7be4c805-b4f1-424d-b67b-2ad70e5da659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Andy Nemzek
Thank you guys for your thoughts here.  This is really useful information.  Again, we're creating daily indexes because that's what logstash does out of the box with the elasticsearch plugin, and this kind of tuning info isn't included with that plugin.

Minimizing both the number of indexes and shards now sounds like a great idea.

We are indeed using EC2.  We're just using an m1.small that's EBS backed (non-SSD).  So, yes, it's not a very powerful machine, but again, we're not throwing a lot of data at it either.


On Thursday, March 12, 2015 at 12:50:22 PM UTC-5, [hidden email] wrote:
With the low volume of ingest, and the long duration of history, Id suggest you may want to trim back the number of shards per index from the default 5.  Based on your 100 docs per day Id say 1 shard per day.  If you combined this with the other suggestion to increase the duration of an index, then you might increase the number of shards, but maybe still not.  Running an optimize once you have completed a time period is great advice if you can afford the overhead, sounds like one day at a time you should be able to, and that the overhead of not optimizing is costing you more when you snapshot.

And index is made of shards, a shard is made of lucene segments.  Lucene segments are the actual files that you copy when you snapshot.  As such the number of segments is multiplied by the number of shards per index and the number of indexes.  Reducing the number of indexes by creating larger time periods will significantly reduce the number of segments.  Reducing the number of shards per index will significantly reduce the number of segments.  Optimizing the index will also consolidate many segments into a single segment.

Based on the use of S3 should we assume you are using AWS EC2?  What instance size?  Your data volume seems very low so it seems concerning that you have such a large time period to snapshot, and points to a slow file system, or a significant number of segments (100 indexes, 5 shards per index, xx segments per shard, == many thousands of segments).  What does your storage system look like?  If you are using EC2 are you using the newer EBS volumes (SSD backed)? Some of the smaller instance size significantly limit prolonged EBS throughput, in my experience. 

On Wednesday, March 11, 2015 at 1:12:01 AM UTC-6, Magnus Bäck wrote:
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <[hidden email]> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
[hidden email] | Sony Mobile Communications

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Aaron Mefford
Yes it was m1.smalls that I first noticed the EBS throttling on.  Things work well in bursts, but sustained EBS does not work well.  It will work substantially better in an m3.medium and if you are using the new EBS SSD volumes.

On Thu, Mar 12, 2015 at 10:30 PM, Andy Nemzek <[hidden email]> wrote:
Thank you guys for your thoughts here.  This is really useful information.  Again, we're creating daily indexes because that's what logstash does out of the box with the elasticsearch plugin, and this kind of tuning info isn't included with that plugin.

Minimizing both the number of indexes and shards now sounds like a great idea.

We are indeed using EC2.  We're just using an m1.small that's EBS backed (non-SSD).  So, yes, it's not a very powerful machine, but again, we're not throwing a lot of data at it either.


On Thursday, March 12, 2015 at 12:50:22 PM UTC-5, [hidden email] wrote:
With the low volume of ingest, and the long duration of history, Id suggest you may want to trim back the number of shards per index from the default 5.  Based on your 100 docs per day Id say 1 shard per day.  If you combined this with the other suggestion to increase the duration of an index, then you might increase the number of shards, but maybe still not.  Running an optimize once you have completed a time period is great advice if you can afford the overhead, sounds like one day at a time you should be able to, and that the overhead of not optimizing is costing you more when you snapshot.

And index is made of shards, a shard is made of lucene segments.  Lucene segments are the actual files that you copy when you snapshot.  As such the number of segments is multiplied by the number of shards per index and the number of indexes.  Reducing the number of indexes by creating larger time periods will significantly reduce the number of segments.  Reducing the number of shards per index will significantly reduce the number of segments.  Optimizing the index will also consolidate many segments into a single segment.

Based on the use of S3 should we assume you are using AWS EC2?  What instance size?  Your data volume seems very low so it seems concerning that you have such a large time period to snapshot, and points to a slow file system, or a significant number of segments (100 indexes, 5 shards per index, xx segments per shard, == many thousands of segments).  What does your storage system look like?  If you are using EC2 are you using the newer EBS volumes (SSD backed)? Some of the smaller instance size significantly limit prolonged EBS throughput, in my experience. 

On Wednesday, March 11, 2015 at 1:12:01 AM UTC-6, Magnus Bäck wrote:
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <[hidden email]> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
[hidden email] | Sony Mobile Communications

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpKGBV__DJb8CfoN%3DvHZRB8zgYq002Bed%3DpQ-pVWXp%2BgQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Andy Nemzek
Good to know, thanks for the tip!  

PS: per the prior conversations, I tried optimizing all the indexes so that they only contain 1 segment per shard.  The operation actually did not take long, but it appears to have had only a marginal effect on the performance of snapshots.  I'm not sure if it matters that the indexes are optimized, but the indexes in the old snapshots aren't...that is, will snapshotting get faster after those old snapshots have all been deleted and replaced with snapshots of the new optimized indexes?

I guess the other two levers to try cranking are reducing shards and reducing indexes.  If I understand correctly, there's no way to do this without writing up some script?


On Friday, March 13, 2015 at 11:52:59 AM UTC-5, Aaron Mefford wrote:
Yes it was m1.smalls that I first noticed the EBS throttling on.  Things work well in bursts, but sustained EBS does not work well.  It will work substantially better in an m3.medium and if you are using the new EBS SSD volumes.

On Thu, Mar 12, 2015 at 10:30 PM, Andy Nemzek <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="uKO_wZQMJ3QJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bitk...@...> wrote:
Thank you guys for your thoughts here.  This is really useful information.  Again, we're creating daily indexes because that's what logstash does out of the box with the elasticsearch plugin, and this kind of tuning info isn't included with that plugin.

Minimizing both the number of indexes and shards now sounds like a great idea.

We are indeed using EC2.  We're just using an m1.small that's EBS backed (non-SSD).  So, yes, it's not a very powerful machine, but again, we're not throwing a lot of data at it either.


On Thursday, March 12, 2015 at 12:50:22 PM UTC-5, [hidden email] wrote:
With the low volume of ingest, and the long duration of history, Id suggest you may want to trim back the number of shards per index from the default 5.  Based on your 100 docs per day Id say 1 shard per day.  If you combined this with the other suggestion to increase the duration of an index, then you might increase the number of shards, but maybe still not.  Running an optimize once you have completed a time period is great advice if you can afford the overhead, sounds like one day at a time you should be able to, and that the overhead of not optimizing is costing you more when you snapshot.

And index is made of shards, a shard is made of lucene segments.  Lucene segments are the actual files that you copy when you snapshot.  As such the number of segments is multiplied by the number of shards per index and the number of indexes.  Reducing the number of indexes by creating larger time periods will significantly reduce the number of segments.  Reducing the number of shards per index will significantly reduce the number of segments.  Optimizing the index will also consolidate many segments into a single segment.

Based on the use of S3 should we assume you are using AWS EC2?  What instance size?  Your data volume seems very low so it seems concerning that you have such a large time period to snapshot, and points to a slow file system, or a significant number of segments (100 indexes, 5 shards per index, xx segments per shard, == many thousands of segments).  What does your storage system look like?  If you are using EC2 are you using the newer EBS volumes (SSD backed)? Some of the smaller instance size significantly limit prolonged EBS throughput, in my experience. 

On Wednesday, March 11, 2015 at 1:12:01 AM UTC-6, Magnus Bäck wrote:
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <[hidden email]> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
[hidden email] | Sony Mobile Communications

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="uKO_wZQMJ3QJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/13ae8aeb-6e07-4a00-bc2a-f8792adc4d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Aaron Mefford
I don't know precisely the snapshoting procedure that you are using, but in general, yes if you have not optimized the old indexes, then you likely have many more shards/indexes in old at this point than you do optimized and they should age off or you can start tackling them one by one.

Regarding the reducing of shards and indexes.  In order to merge existing indexes, yes, a script or possibly a plugin would be required.  There is nothing built-in today to accomplish this task.  I think there are features coming that will make this easier in 1.5 or 2.0.  Honestly though, I think your biggest issue is an undersized instance.  If you need to stay at that size, I can understand that, but you need to make sure you are effectively evaluating the cost of your engineering efforts vs the cost of an increase in instance size.  You could spend a lot of hours trying to make something work, with little success, which would just work, with a larger instance. 

It is also possible that you are exceeding the IOPS for an m1.small or your EBS volume.  Upgrading will improve your IO, reducing the number of segments will reduce the need for IO.  You may want to consider moving to a t2.medium which is in the same ballpark of cost, but can burst to much better performance.

On Mon, Mar 16, 2015 at 8:58 AM, Andy Nemzek <[hidden email]> wrote:
Good to know, thanks for the tip!  

PS: per the prior conversations, I tried optimizing all the indexes so that they only contain 1 segment per shard.  The operation actually did not take long, but it appears to have had only a marginal effect on the performance of snapshots.  I'm not sure if it matters that the indexes are optimized, but the indexes in the old snapshots aren't...that is, will snapshotting get faster after those old snapshots have all been deleted and replaced with snapshots of the new optimized indexes?

I guess the other two levers to try cranking are reducing shards and reducing indexes.  If I understand correctly, there's no way to do this without writing up some script?


On Friday, March 13, 2015 at 11:52:59 AM UTC-5, Aaron Mefford wrote:
Yes it was m1.smalls that I first noticed the EBS throttling on.  Things work well in bursts, but sustained EBS does not work well.  It will work substantially better in an m3.medium and if you are using the new EBS SSD volumes.

On Thu, Mar 12, 2015 at 10:30 PM, Andy Nemzek <[hidden email]> wrote:
Thank you guys for your thoughts here.  This is really useful information.  Again, we're creating daily indexes because that's what logstash does out of the box with the elasticsearch plugin, and this kind of tuning info isn't included with that plugin.

Minimizing both the number of indexes and shards now sounds like a great idea.

We are indeed using EC2.  We're just using an m1.small that's EBS backed (non-SSD).  So, yes, it's not a very powerful machine, but again, we're not throwing a lot of data at it either.


On Thursday, March 12, 2015 at 12:50:22 PM UTC-5, [hidden email] wrote:
With the low volume of ingest, and the long duration of history, Id suggest you may want to trim back the number of shards per index from the default 5.  Based on your 100 docs per day Id say 1 shard per day.  If you combined this with the other suggestion to increase the duration of an index, then you might increase the number of shards, but maybe still not.  Running an optimize once you have completed a time period is great advice if you can afford the overhead, sounds like one day at a time you should be able to, and that the overhead of not optimizing is costing you more when you snapshot.

And index is made of shards, a shard is made of lucene segments.  Lucene segments are the actual files that you copy when you snapshot.  As such the number of segments is multiplied by the number of shards per index and the number of indexes.  Reducing the number of indexes by creating larger time periods will significantly reduce the number of segments.  Reducing the number of shards per index will significantly reduce the number of segments.  Optimizing the index will also consolidate many segments into a single segment.

Based on the use of S3 should we assume you are using AWS EC2?  What instance size?  Your data volume seems very low so it seems concerning that you have such a large time period to snapshot, and points to a slow file system, or a significant number of segments (100 indexes, 5 shards per index, xx segments per shard, == many thousands of segments).  What does your storage system look like?  If you are using EC2 are you using the newer EBS volumes (SSD backed)? Some of the smaller instance size significantly limit prolonged EBS throughput, in my experience. 

On Wednesday, March 11, 2015 at 1:12:01 AM UTC-6, Magnus Bäck wrote:
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <[hidden email]> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
[hidden email] | Sony Mobile Communications

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/13ae8aeb-6e07-4a00-bc2a-f8792adc4d9d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEr8ORchWkznq4MvuuL3%2ByNV6kkDX_z0QOeaD3%2BkPUwTjQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Snapshot Scaling Problems

Andy Nemzek
I wanted to follow up here.  So we tried snapshotting to disk instead of using the s3 plugin.  We've been running this way for the better part of a month and the snapshot/cleanup process appears to only take a couple minutes now...down from over an hour.  So the problems we were experiencing were apparently due to the plugin and how it manipulates data on s3.  We plan on snapshotting to disk and simply syncing that to s3 for permanent storage.   Hopefully this will help others with the same problem!


On Monday, March 16, 2015 at 12:45:48 PM UTC-5, Aaron Mefford wrote:
I don't know precisely the snapshoting procedure that you are using, but in general, yes if you have not optimized the old indexes, then you likely have many more shards/indexes in old at this point than you do optimized and they should age off or you can start tackling them one by one.

Regarding the reducing of shards and indexes.  In order to merge existing indexes, yes, a script or possibly a plugin would be required.  There is nothing built-in today to accomplish this task.  I think there are features coming that will make this easier in 1.5 or 2.0.  Honestly though, I think your biggest issue is an undersized instance.  If you need to stay at that size, I can understand that, but you need to make sure you are effectively evaluating the cost of your engineering efforts vs the cost of an increase in instance size.  You could spend a lot of hours trying to make something work, with little success, which would just work, with a larger instance. 

It is also possible that you are exceeding the IOPS for an m1.small or your EBS volume.  Upgrading will improve your IO, reducing the number of segments will reduce the need for IO.  You may want to consider moving to a t2.medium which is in the same ballpark of cost, but can burst to much better performance.

On Mon, Mar 16, 2015 at 8:58 AM, Andy Nemzek <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="QxBvpwwDzNMJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">bitk...@...> wrote:
Good to know, thanks for the tip!  

PS: per the prior conversations, I tried optimizing all the indexes so that they only contain 1 segment per shard.  The operation actually did not take long, but it appears to have had only a marginal effect on the performance of snapshots.  I'm not sure if it matters that the indexes are optimized, but the indexes in the old snapshots aren't...that is, will snapshotting get faster after those old snapshots have all been deleted and replaced with snapshots of the new optimized indexes?

I guess the other two levers to try cranking are reducing shards and reducing indexes.  If I understand correctly, there's no way to do this without writing up some script?


On Friday, March 13, 2015 at 11:52:59 AM UTC-5, Aaron Mefford wrote:
Yes it was m1.smalls that I first noticed the EBS throttling on.  Things work well in bursts, but sustained EBS does not work well.  It will work substantially better in an m3.medium and if you are using the new EBS SSD volumes.

On Thu, Mar 12, 2015 at 10:30 PM, Andy Nemzek <[hidden email]> wrote:
Thank you guys for your thoughts here.  This is really useful information.  Again, we're creating daily indexes because that's what logstash does out of the box with the elasticsearch plugin, and this kind of tuning info isn't included with that plugin.

Minimizing both the number of indexes and shards now sounds like a great idea.

We are indeed using EC2.  We're just using an m1.small that's EBS backed (non-SSD).  So, yes, it's not a very powerful machine, but again, we're not throwing a lot of data at it either.


On Thursday, March 12, 2015 at 12:50:22 PM UTC-5, [hidden email] wrote:
With the low volume of ingest, and the long duration of history, Id suggest you may want to trim back the number of shards per index from the default 5.  Based on your 100 docs per day Id say 1 shard per day.  If you combined this with the other suggestion to increase the duration of an index, then you might increase the number of shards, but maybe still not.  Running an optimize once you have completed a time period is great advice if you can afford the overhead, sounds like one day at a time you should be able to, and that the overhead of not optimizing is costing you more when you snapshot.

And index is made of shards, a shard is made of lucene segments.  Lucene segments are the actual files that you copy when you snapshot.  As such the number of segments is multiplied by the number of shards per index and the number of indexes.  Reducing the number of indexes by creating larger time periods will significantly reduce the number of segments.  Reducing the number of shards per index will significantly reduce the number of segments.  Optimizing the index will also consolidate many segments into a single segment.

Based on the use of S3 should we assume you are using AWS EC2?  What instance size?  Your data volume seems very low so it seems concerning that you have such a large time period to snapshot, and points to a slow file system, or a significant number of segments (100 indexes, 5 shards per index, xx segments per shard, == many thousands of segments).  What does your storage system look like?  If you are using EC2 are you using the newer EBS volumes (SSD backed)? Some of the smaller instance size significantly limit prolonged EBS throughput, in my experience. 

On Wednesday, March 11, 2015 at 1:12:01 AM UTC-6, Magnus Bäck wrote:
On Monday, March 09, 2015 at 20:29 CET,
     Andy Nemzek <[hidden email]> wrote:

> We've been using logstash for several months now and it creates a new
> index each day, so I imagine there are over 100 indexes at this point.

Why create daily indexes if you only have a few hundred entries in each?
There's a constant overhead for each shard so you don't want more
indexes than you need. Seems like you'd be fine with montly indexes,
and then your snapshot problems would disappear too.

> Elasticsearch is running on a single machine...I haven't done anything
> with shards, so the defaults must be in use.  Haven't optimized old
> indexes.  We're pretty much just running ELK out of the box.  When you
> mention 'optimizing indexes', does this process combine indexes?

No, but it can combine segments in a Lucene index (that make up
Elasticsearch indexes), and segments are what's being backed up.
So the more segments you have the the longer time snapshots are
going to take.

> Do you know if these performance problems are typical when
> using ELK out of the box?

100 indexes on a single box should be okay but it depends on
the size of the JVM heap.

--
Magnus Bäck                | Software Engineer, Development Tools
[hidden email] | Sony Mobile Communications

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/58a94899-185c-4c4d-ad5f-ac2e0a5eed2d%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit <a href="https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe';return true;" onclick="this.href='https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe';return true;">https://groups.google.com/d/topic/elasticsearch/VEdqtEpc3ac/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="QxBvpwwDzNMJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/13ae8aeb-6e07-4a00-bc2a-f8792adc4d9d%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/13ae8aeb-6e07-4a00-bc2a-f8792adc4d9d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/13ae8aeb-6e07-4a00-bc2a-f8792adc4d9d%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/13ae8aeb-6e07-4a00-bc2a-f8792adc4d9d%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3ff4db78-3bc5-46a8-8802-ab3f9631e9c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.