How do you run ES with limited data storage space?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How do you run ES with limited data storage space?

David Reagan
So, I haven't figured out the right search terms to find the answer via Google yet, I've read a lot of the docs on the subject of Snapshot and Restore without finding an answer, and I haven't had the time or resources to test some of my own ideas. Hence, I'm posting this in the hopes that someone who has already solved this problem will share.

How do you run ES with limited data storage space?

Basically, short of getting more space, what can I do to make the best use of what I have, and still meet as many of my goals as possible?

My setup is 4 data nodes. Due to lack of resources/money, they are all thin provisioned VMs, and all my data has to be on NFS/SAN mounts. Storing data on the actual VM's hard disk would negatively effect other VMs and services.

Our NFS SAN is also low on space. So I only have about 1.5TB to use. Initially this seemed like plenty, but a couple weeks ago, ES started complaining about running out of space. Usage on that mount was over 80%. My snapshot repository had ballooned to over 700GB, and each node's data mount point was around 150GB.

Currently, I'm only using ES for logs.

For day to day use, I should be fine with 1 month of open indices. Thus, I've been keeping older indices closed already. So I can't really do much more when it comes to closing indices.

I also run the optimize command nightly on any logstash index older that a couple days.

I'd just delete the really old data, but I have use cases for data up to 1.5 years old. Considering that snapshots of only a few months nearly used up all my space, and how much space a month of logs is currently taking up, I'm not sure how I can store that much data.

So, in general, how would you solve my problem? I need to have immediate access to 1 months worth of logs (via Kibana), be able to relatively quickly access up to 6 months of logs (open closed indices?), and access up to 1.5 years worth temporarily (restore snapshots to new cluster on my desktop?)

Would there be a way to move snapshots off of the NFS SAN to an external hard drive?

Should I tell logstash to send logs to a text file that get's logrotated for a year and a half? Or does ES do a good enough job with compression that gzipping wouldn't help? If it was just a text file, I could unzip it, then tell Logstash to read the file into an ES cluster.

ES already compresses stored indices by default, right? So there's nothing I can do there?


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How do you run ES with limited data storage space?

Mark Walkom-2
There's not a lot you can do here unless you want to start uploading snapshots to S3, or something else that is not on your NAS.
ES does compress by default and we are working on using a better algorithm for future releases which will help, but there's no ETA for that.

On 16 March 2015 at 17:29, David Reagan <[hidden email]> wrote:
So, I haven't figured out the right search terms to find the answer via Google yet, I've read a lot of the docs on the subject of Snapshot and Restore without finding an answer, and I haven't had the time or resources to test some of my own ideas. Hence, I'm posting this in the hopes that someone who has already solved this problem will share.

How do you run ES with limited data storage space?

Basically, short of getting more space, what can I do to make the best use of what I have, and still meet as many of my goals as possible?

My setup is 4 data nodes. Due to lack of resources/money, they are all thin provisioned VMs, and all my data has to be on NFS/SAN mounts. Storing data on the actual VM's hard disk would negatively effect other VMs and services.

Our NFS SAN is also low on space. So I only have about 1.5TB to use. Initially this seemed like plenty, but a couple weeks ago, ES started complaining about running out of space. Usage on that mount was over 80%. My snapshot repository had ballooned to over 700GB, and each node's data mount point was around 150GB.

Currently, I'm only using ES for logs.

For day to day use, I should be fine with 1 month of open indices. Thus, I've been keeping older indices closed already. So I can't really do much more when it comes to closing indices.

I also run the optimize command nightly on any logstash index older that a couple days.

I'd just delete the really old data, but I have use cases for data up to 1.5 years old. Considering that snapshots of only a few months nearly used up all my space, and how much space a month of logs is currently taking up, I'm not sure how I can store that much data.

So, in general, how would you solve my problem? I need to have immediate access to 1 months worth of logs (via Kibana), be able to relatively quickly access up to 6 months of logs (open closed indices?), and access up to 1.5 years worth temporarily (restore snapshots to new cluster on my desktop?)

Would there be a way to move snapshots off of the NFS SAN to an external hard drive?

Should I tell logstash to send logs to a text file that get's logrotated for a year and a half? Or does ES do a good enough job with compression that gzipping wouldn't help? If it was just a text file, I could unzip it, then tell Logstash to read the file into an ES cluster.

ES already compresses stored indices by default, right? So there's nothing I can do there?


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9DZhYKszmDRrssh%3DiNb6UAJ8EU6eHGPN-OPaORxmvM2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How do you run ES with limited data storage space?

Aaron Mefford
While ES does compress by default, it also stores data in data structures, that increase the size of the data. The net is that your data will be much larger than the equivalent log file gzipped.  However, running logstash to ingest 1.5 years of logs may well take much longer than you would expect.

There is no reason you shouldn't be able to move snapshots off of your shared drive onto an external drive or other storage, such as S3.

One thing you should reconsider is what you are trying to do with your resources.  It sounds like it is simply too much.  If the budget cannot budge to accommodate the requirements, then the requirements must budge to accommodate the budget.  Perhaps you can identify some log sources that do not have the same retention requirements.  Perhaps it is some segment of your logs that is not as important.  For instance is it really important to keep that Java Stack trace from a year ago?  Now I don't know the nature of your logs, but I do know the nature of logs, and there are important log entries, and there are mundane repetitive entries.  What I am driving at is that leveraging the ability of using ES aliasing and cross index searching you can segment your logs into important indexes and not important.  You can still search across all the indexes, but you can establish retention policies which differ for the less important, while preserving the precious resources you have for the important.

Some data you can take an RRD style approach with and create indexes that have summary information in them which will allow you to generate historical dashboards that still capture the essence of the day, if not the detail.  For instance while you could not show the individual requests on a given day, you could still show the request volume over a three year period.

While this goes against the nature of the e logging efforts, these are some of the ideas I had while reading about your situation.

Aaron

On Monday, March 16, 2015 at 6:42:43 PM UTC-6, Mark Walkom wrote:
There's not a lot you can do here unless you want to start uploading snapshots to S3, or something else that is not on your NAS.
ES does compress by default and we are working on using a better algorithm for future releases which will help, but there's no ETA for that.

On 16 March 2015 at 17:29, David Reagan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="W3vDXd-tQIgJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">jer...@...> wrote:
So, I haven't figured out the right search terms to find the answer via Google yet, I've read a lot of the docs on the subject of Snapshot and Restore without finding an answer, and I haven't had the time or resources to test some of my own ideas. Hence, I'm posting this in the hopes that someone who has already solved this problem will share.

How do you run ES with limited data storage space?

Basically, short of getting more space, what can I do to make the best use of what I have, and still meet as many of my goals as possible?

My setup is 4 data nodes. Due to lack of resources/money, they are all thin provisioned VMs, and all my data has to be on NFS/SAN mounts. Storing data on the actual VM's hard disk would negatively effect other VMs and services.

Our NFS SAN is also low on space. So I only have about 1.5TB to use. Initially this seemed like plenty, but a couple weeks ago, ES started complaining about running out of space. Usage on that mount was over 80%. My snapshot repository had ballooned to over 700GB, and each node's data mount point was around 150GB.

Currently, I'm only using ES for logs.

For day to day use, I should be fine with 1 month of open indices. Thus, I've been keeping older indices closed already. So I can't really do much more when it comes to closing indices.

I also run the optimize command nightly on any logstash index older that a couple days.

I'd just delete the really old data, but I have use cases for data up to 1.5 years old. Considering that snapshots of only a few months nearly used up all my space, and how much space a month of logs is currently taking up, I'm not sure how I can store that much data.

So, in general, how would you solve my problem? I need to have immediate access to 1 months worth of logs (via Kibana), be able to relatively quickly access up to 6 months of logs (open closed indices?), and access up to 1.5 years worth temporarily (restore snapshots to new cluster on my desktop?)

Would there be a way to move snapshots off of the NFS SAN to an external hard drive?

Should I tell logstash to send logs to a text file that get's logrotated for a year and a half? Or does ES do a good enough job with compression that gzipping wouldn't help? If it was just a text file, I could unzip it, then tell Logstash to read the file into an ES cluster.

ES already compresses stored indices by default, right? So there's nothing I can do there?


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="W3vDXd-tQIgJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db9d04c7-70d5-4810-899d-bc025c01ec21%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.