JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

Gavin Seng

### JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak?

** reposting because 1st one came out w/o images and all kinds of strange spaces.

Hi,

We're seeing issues where GC collects less and less memory over time leading to the need to restart our nodes.

The following is our setup and what we've tried. Please tell me if anything is lacking and I'll be glad to provide more details.

Also appreciate any advice on how we can improve our configurations.

### 32 GB heap

http://i.imgur.com/JNpWeTw.png


### 65 GB heap

http://i.imgur.com/qcLhC3M.png



### 65 GB heap with changed young/old ratio

http://i.imgur.com/Aa3fOMG.png


### Cluster Setup

* Tribes that link to 2 clusters
* Cluster 1
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 2 hourly indices (1 for syslog, 1 for application logs)
    * 1 replica
    * Each index ~ 2 million docs (6gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)
* Cluster 2
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 1 hourly index
    * 1 replica
    * Each index ~ 8 million docs (20gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)

Interestingly, we're actually having problems on Cluster 1's hot nodes even though it indexes less.

It suggests that this is a problem with searching because Cluster 1 is searched on a lot more.

### Machine settings (hot node)

* java
  * java version "1.7.0_11"
  * Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
  * Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
* 128gb ram
* 8 cores, 32 cpus
* ssds (raid 0)

### JVM settings

```
java
-Xms96g -Xmx96g -Xss256k
-Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/elasticsearch/gc.log -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
-Xloggc:[...]
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=[...]
-Dcom.sun.management.jmxremote.ssl=[...] -Dcom.sun.management.jmxremote.authenticate=[...]
-Dcom.sun.management.jmxremote.port=[...]
-Delasticsearch -Des.pidfile=[...]
-Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.1.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=[...]
-Des.default.path.data=[...]
-Des.default.path.work=[...]
-Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
```

## Key elasticsearch.yml settings

* threadpool.bulk.type: fixed
* threadpool.bulk.queue_size: 1000
* indices.memory.index_buffer_size: 30%
* index.translog.flush_threshold_ops: 50000
* indices.fielddata.cache.size: 30%


### Search Load (Cluster 1)

* Mainly Kibana3 (queries ES with daily alias that expands to 24 hourly indices)
* Jenkins jobs that constantly run and do many faceting/aggregations for the last hour's of data

### Things we've tried (unsuccesfully)

* GC settings
  * young/old ratio
    * Set young/old ration to 50/50 hoping that things would get GCed before having the chance to move to old.
    * The old grew at a slower rate but still things could not be collected.
  * survivor space ratio
    * Give survivor space a higher ratio of young
    * Increase number of generations to make it to old be 10 (up from 6)
  * Lower cms occupancy ratio
    * Tried 60% hoping to kick GC earlier. GC kicked in earlier but still could not collect.
* Limit filter/field cache
  * indices.fielddata.cache.size: 32GB
  * indices.cache.filter.size: 4GB
* Optimizing index to 1 segment on the 3rd hour
* Limit JVM to 32 gb ram
  * reference: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
* Limit JVM to 65 gb ram
  * This fulfils the 'leave 50% to the os' principle.
* Read 90.5/7 OOM errors-- memory leak or GC problems?
  * https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ
  * But we're not using term filters

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0473e6c0-72d3-43e5-bda7-03022d7bffac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

Gavin Seng

Thanks Adrien, my cache is exactly 32GB so I'm cautiously optimistic ... will try it out and report back!

From Adrien Grand:
You might be hit by the following Guava bug: https://github.com/elasticsearch/elasticsearch/issues/6268. It was fixed in Elasticsearch 1.1.3/1.2.1/1.3.0


On Monday, October 20, 2014 11:42:34 AM UTC-4, Gavin Seng wrote:

### JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak?

** reposting because 1st one came out w/o images and all kinds of strange spaces.

Hi,

We're seeing issues where GC collects less and less memory over time leading to the need to restart our nodes.

The following is our setup and what we've tried. Please tell me if anything is lacking and I'll be glad to provide more details.

Also appreciate any advice on how we can improve our configurations.

### 32 GB heap

<a href="http://i.imgur.com/JNpWeTw.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;">http://i.imgur.com/JNpWeTw.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### 65 GB heap

<a href="http://i.imgur.com/qcLhC3M.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">http://i.imgur.com/qcLhC3M.png
<a href="http://i.imgur.com/qcLhC3M.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">



### 65 GB heap with changed young/old ratio

<a href="http://i.imgur.com/Aa3fOMG.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">http://i.imgur.com/Aa3fOMG.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### Cluster Setup

* Tribes that link to 2 clusters
* Cluster 1
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 2 hourly indices (1 for syslog, 1 for application logs)
    * 1 replica
    * Each index ~ 2 million docs (6gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)
* Cluster 2
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 1 hourly index
    * 1 replica
    * Each index ~ 8 million docs (20gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)

Interestingly, we're actually having problems on Cluster 1's hot nodes even though it indexes less.

It suggests that this is a problem with searching because Cluster 1 is searched on a lot more.

### Machine settings (hot node)

* java
  * java version "1.7.0_11"
  * Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
  * Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
* 128gb ram
* 8 cores, 32 cpus
* ssds (raid 0)

### JVM settings

```
java
-Xms96g -Xmx96g -Xss256k
-Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/elasticsearch/gc.log -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
-Xloggc:[...]
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=[...]
-Dcom.sun.management.jmxremote.ssl=[...] -Dcom.sun.management.jmxremote.authenticate=[...]
-Dcom.sun.management.jmxremote.port=[...]
-Delasticsearch -Des.pidfile=[...]
-Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.1.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=[...]
-Des.default.path.data=[...]
-Des.default.path.work=[...]
-Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
```

## Key elasticsearch.yml settings

* threadpool.bulk.type: fixed
* threadpool.bulk.queue_size: 1000
* indices.memory.index_buffer_size: 30%
* index.translog.flush_threshold_ops: 50000
* indices.fielddata.cache.size: 30%


### Search Load (Cluster 1)

* Mainly Kibana3 (queries ES with daily alias that expands to 24 hourly indices)
* Jenkins jobs that constantly run and do many faceting/aggregations for the last hour's of data

### Things we've tried (unsuccesfully)

* GC settings
  * young/old ratio
    * Set young/old ration to 50/50 hoping that things would get GCed before having the chance to move to old.
    * The old grew at a slower rate but still things could not be collected.
  * survivor space ratio
    * Give survivor space a higher ratio of young
    * Increase number of generations to make it to old be 10 (up from 6)
  * Lower cms occupancy ratio
    * Tried 60% hoping to kick GC earlier. GC kicked in earlier but still could not collect.
* Limit filter/field cache
  * indices.fielddata.cache.size: 32GB
  * indices.cache.filter.size: 4GB
* Optimizing index to 1 segment on the 3rd hour
* Limit JVM to 32 gb ram
  * reference: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
* Limit JVM to 65 gb ram
  * This fulfils the 'leave 50% to the os' principle.
* Read 90.5/7 OOM errors-- memory leak or GC problems?
  * <a href="https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;" onclick="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;">https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ
  * But we're not using term filters

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58151c45-b53d-4790-b2ca-bf538d01ce2c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

Gavin Seng

Actually now that I read the bug a little more carefully, I'm not so optimistic.

* The cache here (https://github.com/elasticsearch/elasticsearch/issues/6268) is the filter cache and mine was only set at 8 gb.
* Maybe fielddata is a guava cache ... but I did set it to 30% for a run with 96gb heap - so the fielddata cache is 28.8gb (< 32 gb).

Nonetheless, I'm trying a run now with an explicit 31gb of fielddata cache and will report back.

### 96 gb heap with 30% fielddata cache and 8gb filter cache

http://i.imgur.com/FMp49ZZ.png



On Monday, October 20, 2014 9:18:22 PM UTC-4, Gavin Seng wrote:

Thanks Adrien, my cache is exactly 32GB so I'm cautiously optimistic ... will try it out and report back!

From Adrien Grand:
You might be hit by the following Guava bug: <a href="https://github.com/elasticsearch/elasticsearch/issues/6268" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;">https://github.com/elasticsearch/elasticsearch/issues/6268. It was fixed in Elasticsearch 1.1.3/1.2.1/1.3.0


On Monday, October 20, 2014 11:42:34 AM UTC-4, Gavin Seng wrote:

### JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak?

** reposting because 1st one came out w/o images and all kinds of strange spaces.

Hi,

We're seeing issues where GC collects less and less memory over time leading to the need to restart our nodes.

The following is our setup and what we've tried. Please tell me if anything is lacking and I'll be glad to provide more details.

Also appreciate any advice on how we can improve our configurations.

### 32 GB heap

<a href="http://i.imgur.com/JNpWeTw.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;">http://i.imgur.com/JNpWeTw.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### 65 GB heap

<a href="http://i.imgur.com/qcLhC3M.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">http://i.imgur.com/qcLhC3M.png
<a href="http://i.imgur.com/qcLhC3M.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">



### 65 GB heap with changed young/old ratio

<a href="http://i.imgur.com/Aa3fOMG.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">http://i.imgur.com/Aa3fOMG.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### Cluster Setup

* Tribes that link to 2 clusters
* Cluster 1
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 2 hourly indices (1 for syslog, 1 for application logs)
    * 1 replica
    * Each index ~ 2 million docs (6gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)
* Cluster 2
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 1 hourly index
    * 1 replica
    * Each index ~ 8 million docs (20gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)

Interestingly, we're actually having problems on Cluster 1's hot nodes even though it indexes less.

It suggests that this is a problem with searching because Cluster 1 is searched on a lot more.

### Machine settings (hot node)

* java
  * java version "1.7.0_11"
  * Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
  * Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
* 128gb ram
* 8 cores, 32 cpus
* ssds (raid 0)

### JVM settings

```
java
-Xms96g -Xmx96g -Xss256k
-Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/elasticsearch/gc.log -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
-Xloggc:[...]
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=[...]
-Dcom.sun.management.jmxremote.ssl=[...] -Dcom.sun.management.jmxremote.authenticate=[...]
-Dcom.sun.management.jmxremote.port=[...]
-Delasticsearch -Des.pidfile=[...]
-Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.1.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=[...]
-Des.default.path.data=[...]
-Des.default.path.work=[...]
-Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
```

## Key elasticsearch.yml settings

* threadpool.bulk.type: fixed
* threadpool.bulk.queue_size: 1000
* indices.memory.index_buffer_size: 30%
* index.translog.flush_threshold_ops: 50000
* indices.fielddata.cache.size: 30%


### Search Load (Cluster 1)

* Mainly Kibana3 (queries ES with daily alias that expands to 24 hourly indices)
* Jenkins jobs that constantly run and do many faceting/aggregations for the last hour's of data

### Things we've tried (unsuccesfully)

* GC settings
  * young/old ratio
    * Set young/old ration to 50/50 hoping that things would get GCed before having the chance to move to old.
    * The old grew at a slower rate but still things could not be collected.
  * survivor space ratio
    * Give survivor space a higher ratio of young
    * Increase number of generations to make it to old be 10 (up from 6)
  * Lower cms occupancy ratio
    * Tried 60% hoping to kick GC earlier. GC kicked in earlier but still could not collect.
* Limit filter/field cache
  * indices.fielddata.cache.size: 32GB
  * indices.cache.filter.size: 4GB
* Optimizing index to 1 segment on the 3rd hour
* Limit JVM to 32 gb ram
  * reference: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
* Limit JVM to 65 gb ram
  * This fulfils the 'leave 50% to the os' principle.
* Read 90.5/7 OOM errors-- memory leak or GC problems?
  * <a href="https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;" onclick="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;">https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ
  * But we're not using term filters

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

Adrien Grand-2
Gavin,

Can you look at the stats APIs to see what they report regarding memory? For instance the following call to the _cat API would return memory usage for fielddata, filter cache, segments, the index writer and the version map:

  curl -XGET 'localhost:9200/_cat/nodes?v&h=v,j,hm,fm,fcm,sm,siwm,svmm'



On Tue, Oct 21, 2014 at 5:01 AM, Gavin Seng <[hidden email]> wrote:

Actually now that I read the bug a little more carefully, I'm not so optimistic.

* The cache here (https://github.com/elasticsearch/elasticsearch/issues/6268) is the filter cache and mine was only set at 8 gb.
* Maybe fielddata is a guava cache ... but I did set it to 30% for a run with 96gb heap - so the fielddata cache is 28.8gb (< 32 gb).

Nonetheless, I'm trying a run now with an explicit 31gb of fielddata cache and will report back.

### 96 gb heap with 30% fielddata cache and 8gb filter cache




On Monday, October 20, 2014 9:18:22 PM UTC-4, Gavin Seng wrote:

Thanks Adrien, my cache is exactly 32GB so I'm cautiously optimistic ... will try it out and report back!

From Adrien Grand:
You might be hit by the following Guava bug: https://github.com/elasticsearch/elasticsearch/issues/6268. It was fixed in Elasticsearch 1.1.3/1.2.1/1.3.0


On Monday, October 20, 2014 11:42:34 AM UTC-4, Gavin Seng wrote:

### JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak?

** reposting because 1st one came out w/o images and all kinds of strange spaces.

Hi,

We're seeing issues where GC collects less and less memory over time leading to the need to restart our nodes.

The following is our setup and what we've tried. Please tell me if anything is lacking and I'll be glad to provide more details.

Also appreciate any advice on how we can improve our configurations.

### 32 GB heap



### 65 GB heap




### 65 GB heap with changed young/old ratio



### Cluster Setup

* Tribes that link to 2 clusters
* Cluster 1
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 2 hourly indices (1 for syslog, 1 for application logs)
    * 1 replica
    * Each index ~ 2 million docs (6gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)
* Cluster 2
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 1 hourly index
    * 1 replica
    * Each index ~ 8 million docs (20gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)

Interestingly, we're actually having problems on Cluster 1's hot nodes even though it indexes less.

It suggests that this is a problem with searching because Cluster 1 is searched on a lot more.

### Machine settings (hot node)

* java
  * java version "1.7.0_11"
  * Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
  * Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
* 128gb ram
* 8 cores, 32 cpus
* ssds (raid 0)

### JVM settings

```
java
-Xms96g -Xmx96g -Xss256k
-Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/elasticsearch/gc.log -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
-Xloggc:[...]
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=[...]
-Dcom.sun.management.jmxremote.ssl=[...] -Dcom.sun.management.jmxremote.authenticate=[...]
-Dcom.sun.management.jmxremote.port=[...]
-Delasticsearch -Des.pidfile=[...]
-Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.1.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=[...]
-Des.default.path.data=[...]
-Des.default.path.work=[...]
-Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
```

## Key elasticsearch.yml settings

* threadpool.bulk.type: fixed
* threadpool.bulk.queue_size: 1000
* indices.memory.index_buffer_size: 30%
* index.translog.flush_threshold_ops: 50000
* indices.fielddata.cache.size: 30%


### Search Load (Cluster 1)

* Mainly Kibana3 (queries ES with daily alias that expands to 24 hourly indices)
* Jenkins jobs that constantly run and do many faceting/aggregations for the last hour's of data

### Things we've tried (unsuccesfully)

* GC settings
  * young/old ratio
    * Set young/old ration to 50/50 hoping that things would get GCed before having the chance to move to old.
    * The old grew at a slower rate but still things could not be collected.
  * survivor space ratio
    * Give survivor space a higher ratio of young
    * Increase number of generations to make it to old be 10 (up from 6)
  * Lower cms occupancy ratio
    * Tried 60% hoping to kick GC earlier. GC kicked in earlier but still could not collect.
* Limit filter/field cache
  * indices.fielddata.cache.size: 32GB
  * indices.cache.filter.size: 4GB
* Optimizing index to 1 segment on the 3rd hour
* Limit JVM to 32 gb ram
* Limit JVM to 65 gb ram
  * This fulfils the 'leave 50% to the os' principle.
* Read 90.5/7 OOM errors-- memory leak or GC problems?
  * But we're not using term filters

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6BXpSNsH4Es5ERO-k0r5AZW9joX_2_yZ3tZoj5D3AKew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

Gavin Seng

Hi Adrien,

Unfortunately explicitly setting to 31GB did not work.

This is stats @ 1700. (it's been runing from 2300 previous day to 1700):

v     j              hm      fm   fcm    sm 
1.0.1 1.7.0_11   23.8gb      0b    0b    0b 
1.0.1 1.7.0_11    1.9gb      0b    0b    0b 
1.0.1 1.7.0_11   23.9gb 243.8mb 1.3gb   5gb 
1.0.1 1.7.0_11   11.9gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11    7.8gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11   23.9gb  39.5mb 2.9gb 5.1gb 
1.0.1 1.7.0_11    1.9gb      0b    0b    0b 
1.0.1 1.7.0_11   11.6gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11   23.8gb      0b    0b    0b 
1.0.1 1.7.0_11    1.9gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11   95.8gb  11.6gb 7.9gb 1.6gb 
1.0.1 1.7.0_11   95.8gb  10.5gb 7.9gb 1.6gb 

The last 2 items are our hot nodes.

### Heap from 1600 - 1700

http://i.imgur.com/GJnRmhw.jpg




### Heap as % of total heap size

http://i.imgur.com/CkC6P7K.jpg


## Heap as % (from 2300)

http://i.imgur.com/GFQSK8R.jpg






On Tuesday, October 21, 2014 4:01:36 AM UTC-4, Adrien Grand wrote:
Gavin,

Can you look at the stats APIs to see what they report regarding memory? For instance the following call to the _cat API would return memory usage for fielddata, filter cache, segments, the index writer and the version map:

  curl -XGET 'localhost:9200/_cat/nodes?v&h=v,j,hm,fm,fcm,sm,siwm,svmm'



On Tue, Oct 21, 2014 at 5:01 AM, Gavin Seng <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="2NmWGbkcutAJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">seng....@...> wrote:

Actually now that I read the bug a little more carefully, I'm not so optimistic.

* The cache here (<a href="https://github.com/elasticsearch/elasticsearch/issues/6268" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;">https://github.com/elasticsearch/elasticsearch/issues/6268) is the filter cache and mine was only set at 8 gb.
* Maybe fielddata is a guava cache ... but I did set it to 30% for a run with 96gb heap - so the fielddata cache is 28.8gb (< 32 gb).

Nonetheless, I'm trying a run now with an explicit 31gb of fielddata cache and will report back.

### 96 gb heap with 30% fielddata cache and 8gb filter cache

<a href="http://i.imgur.com/FMp49ZZ.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;">http://i.imgur.com/FMp49ZZ.png

<a href="http://i.imgur.com/FMp49ZZ.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;">


On Monday, October 20, 2014 9:18:22 PM UTC-4, Gavin Seng wrote:

Thanks Adrien, my cache is exactly 32GB so I'm cautiously optimistic ... will try it out and report back!

From Adrien Grand:
You might be hit by the following Guava bug: <a href="https://github.com/elasticsearch/elasticsearch/issues/6268" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;">https://github.com/elasticsearch/elasticsearch/issues/6268. It was fixed in Elasticsearch 1.1.3/1.2.1/1.3.0


On Monday, October 20, 2014 11:42:34 AM UTC-4, Gavin Seng wrote:

### JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak?

** reposting because 1st one came out w/o images and all kinds of strange spaces.

Hi,

We're seeing issues where GC collects less and less memory over time leading to the need to restart our nodes.

The following is our setup and what we've tried. Please tell me if anything is lacking and I'll be glad to provide more details.

Also appreciate any advice on how we can improve our configurations.

### 32 GB heap

<a href="http://i.imgur.com/JNpWeTw.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;">http://i.imgur.com/JNpWeTw.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### 65 GB heap

<a href="http://i.imgur.com/qcLhC3M.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">http://i.imgur.com/qcLhC3M.png
<a href="http://i.imgur.com/qcLhC3M.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">



### 65 GB heap with changed young/old ratio

<a href="http://i.imgur.com/Aa3fOMG.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">http://i.imgur.com/Aa3fOMG.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### Cluster Setup

* Tribes that link to 2 clusters
* Cluster 1
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 2 hourly indices (1 for syslog, 1 for application logs)
    * 1 replica
    * Each index ~ 2 million docs (6gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)
* Cluster 2
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 1 hourly index
    * 1 replica
    * Each index ~ 8 million docs (20gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)

Interestingly, we're actually having problems on Cluster 1's hot nodes even though it indexes less.

It suggests that this is a problem with searching because Cluster 1 is searched on a lot more.

### Machine settings (hot node)

* java
  * java version "1.7.0_11"
  * Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
  * Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
* 128gb ram
* 8 cores, 32 cpus
* ssds (raid 0)

### JVM settings

```
java
-Xms96g -Xmx96g -Xss256k
-Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/elasticsearch/gc.log -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
-Xloggc:[...]
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=[...]
-Dcom.sun.management.jmxremote.ssl=[...] -Dcom.sun.management.jmxremote.authenticate=[...]
-Dcom.sun.management.jmxremote.port=[...]
-Delasticsearch -Des.pidfile=[...]
-Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.1.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=[...]
-Des.default.path.data=[...]
-Des.default.path.work=[...]
-Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
```

## Key elasticsearch.yml settings

* threadpool.bulk.type: fixed
* threadpool.bulk.queue_size: 1000
* indices.memory.index_buffer_size: 30%
* index.translog.flush_threshold_ops: 50000
* indices.fielddata.cache.size: 30%


### Search Load (Cluster 1)

* Mainly Kibana3 (queries ES with daily alias that expands to 24 hourly indices)
* Jenkins jobs that constantly run and do many faceting/aggregations for the last hour's of data

### Things we've tried (unsuccesfully)

* GC settings
  * young/old ratio
    * Set young/old ration to 50/50 hoping that things would get GCed before having the chance to move to old.
    * The old grew at a slower rate but still things could not be collected.
  * survivor space ratio
    * Give survivor space a higher ratio of young
    * Increase number of generations to make it to old be 10 (up from 6)
  * Lower cms occupancy ratio
    * Tried 60% hoping to kick GC earlier. GC kicked in earlier but still could not collect.
* Limit filter/field cache
  * indices.fielddata.cache.size: 32GB
  * indices.cache.filter.size: 4GB
* Optimizing index to 1 segment on the 3rd hour
* Limit JVM to 32 gb ram
  * reference: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
* Limit JVM to 65 gb ram
  * This fulfils the 'leave 50% to the os' principle.
* Read 90.5/7 OOM errors-- memory leak or GC problems?
  * <a href="https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;" onclick="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;">https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ
  * But we're not using term filters

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="2NmWGbkcutAJ" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.



--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/92b44a8b-9893-4269-8e08-51e3ed54ae23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak? (reposted with better formatting)

Gavin Seng

After doing these, we've managed to stabilize the nodes.

* Decrease the number of shards to 1 (from the default 5). This was done knowing that our machines would always be able to handle a single shard. Also, we believe that the memory is being consumed by searching and that is somehow linked to the number of shards (lucene indices). We had run a daily instead of hourly configuration before and did not face this issue (we faced others though ...) and one thing that hourly indices does is increase the number of shards/segments by a factor of 24.
* Increase JVM RAM usage to 75% of available RAM (up from the recommended 50%). We'd rather sacrifice the os file cache then a JVM that OOMs.
* Decrease indices.fielddata.cache.size to 31GB (from 32GB). This is following Adrien's advice (Thanks!) since  we're on 1.0.1 which has the Guice bug.

## Pictures for a Monday which is a regular traffic day which the previous configuration would not have been able to handle.

http://i.imgur.com/JFCxROd.png


http://i.imgur.com/Q2BSmsM.png






On Tuesday, October 21, 2014 5:07:38 PM UTC-4, Gavin Seng wrote:

Hi Adrien,

Unfortunately explicitly setting to 31GB did not work.

This is stats @ 1700. (it's been runing from 2300 previous day to 1700):

v     j              hm      fm   fcm    sm 
1.0.1 1.7.0_11   23.8gb      0b    0b    0b 
1.0.1 1.7.0_11    1.9gb      0b    0b    0b 
1.0.1 1.7.0_11   23.9gb 243.8mb 1.3gb   5gb 
1.0.1 1.7.0_11   11.9gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11    7.8gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11   23.9gb  39.5mb 2.9gb 5.1gb 
1.0.1 1.7.0_11    1.9gb      0b    0b    0b 
1.0.1 1.7.0_11   11.6gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11   23.8gb      0b    0b    0b 
1.0.1 1.7.0_11    1.9gb      0b    0b    0b 
1.0.1 1.7.0_11 1007.3mb      0b    0b    0b 
1.0.1 1.7.0_11   95.8gb  11.6gb 7.9gb 1.6gb 
1.0.1 1.7.0_11   95.8gb  10.5gb 7.9gb 1.6gb 

The last 2 items are our hot nodes.

### Heap from 1600 - 1700

<a href="http://i.imgur.com/GJnRmhw.jpg" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGJnRmhw.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNEGN1ywEJw6yx2I3HaO6hEP7xGjWA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGJnRmhw.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNEGN1ywEJw6yx2I3HaO6hEP7xGjWA';return true;">http://i.imgur.com/GJnRmhw.jpg


<a href="http://i.imgur.com/GJnRmhw.jpg" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGJnRmhw.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNEGN1ywEJw6yx2I3HaO6hEP7xGjWA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGJnRmhw.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNEGN1ywEJw6yx2I3HaO6hEP7xGjWA';return true;">


### Heap as % of total heap size

<a href="http://i.imgur.com/CkC6P7K.jpg" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FCkC6P7K.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHRe1EIJVwsHsgVWROllJYXFKcKfg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FCkC6P7K.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHRe1EIJVwsHsgVWROllJYXFKcKfg';return true;">http://i.imgur.com/CkC6P7K.jpg

<a href="http://i.imgur.com/CkC6P7K.jpg" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FCkC6P7K.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHRe1EIJVwsHsgVWROllJYXFKcKfg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FCkC6P7K.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHRe1EIJVwsHsgVWROllJYXFKcKfg';return true;">

## Heap as % (from 2300)

<a href="http://i.imgur.com/GFQSK8R.jpg" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGFQSK8R.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHhY-pbboHWht3JTJpgeGcRgyv7Iw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGFQSK8R.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHhY-pbboHWht3JTJpgeGcRgyv7Iw';return true;">http://i.imgur.com/GFQSK8R.jpg

<a href="http://i.imgur.com/GFQSK8R.jpg" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGFQSK8R.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHhY-pbboHWht3JTJpgeGcRgyv7Iw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FGFQSK8R.jpg\46sa\75D\46sntz\0751\46usg\75AFQjCNHhY-pbboHWht3JTJpgeGcRgyv7Iw';return true;">





On Tuesday, October 21, 2014 4:01:36 AM UTC-4, Adrien Grand wrote:
Gavin,

Can you look at the stats APIs to see what they report regarding memory? For instance the following call to the _cat API would return memory usage for fielddata, filter cache, segments, the index writer and the version map:

  curl -XGET 'localhost:9200/_cat/nodes?v&h=v,j,hm,fm,fcm,sm,siwm,svmm'



On Tue, Oct 21, 2014 at 5:01 AM, Gavin Seng <[hidden email]> wrote:

Actually now that I read the bug a little more carefully, I'm not so optimistic.

* The cache here (<a href="https://github.com/elasticsearch/elasticsearch/issues/6268" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;">https://github.com/elasticsearch/elasticsearch/issues/6268) is the filter cache and mine was only set at 8 gb.
* Maybe fielddata is a guava cache ... but I did set it to 30% for a run with 96gb heap - so the fielddata cache is 28.8gb (< 32 gb).

Nonetheless, I'm trying a run now with an explicit 31gb of fielddata cache and will report back.

### 96 gb heap with 30% fielddata cache and 8gb filter cache

<a href="http://i.imgur.com/FMp49ZZ.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;">http://i.imgur.com/FMp49ZZ.png

<a href="http://i.imgur.com/FMp49ZZ.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FFMp49ZZ.png\46sa\75D\46sntz\0751\46usg\75AFQjCNFT3QiLBNoDiqaDXxwLXXdkA295Cg';return true;">


On Monday, October 20, 2014 9:18:22 PM UTC-4, Gavin Seng wrote:

Thanks Adrien, my cache is exactly 32GB so I'm cautiously optimistic ... will try it out and report back!

From Adrien Grand:
You might be hit by the following Guava bug: <a href="https://github.com/elasticsearch/elasticsearch/issues/6268" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F6268\46sa\75D\46sntz\0751\46usg\75AFQjCNE1exdQmeGjMQqJ1nJx-n6XQElJxw';return true;">https://github.com/elasticsearch/elasticsearch/issues/6268. It was fixed in Elasticsearch 1.1.3/1.2.1/1.3.0


On Monday, October 20, 2014 11:42:34 AM UTC-4, Gavin Seng wrote:

### JRE 1.7.0_11 / ES 1.0.1 - GC not collecting old gen / Memory Leak?

** reposting because 1st one came out w/o images and all kinds of strange spaces.

Hi,

We're seeing issues where GC collects less and less memory over time leading to the need to restart our nodes.

The following is our setup and what we've tried. Please tell me if anything is lacking and I'll be glad to provide more details.

Also appreciate any advice on how we can improve our configurations.

### 32 GB heap

<a href="http://i.imgur.com/JNpWeTw.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FJNpWeTw.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEcPxwzZUbgyJEoK1bWSd2I2gXNNw';return true;">http://i.imgur.com/JNpWeTw.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### 65 GB heap

<a href="http://i.imgur.com/qcLhC3M.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">http://i.imgur.com/qcLhC3M.png
<a href="http://i.imgur.com/qcLhC3M.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FqcLhC3M.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEI0CEpogy7LRrcNDbmt8Ub-sx1tA';return true;">



### 65 GB heap with changed young/old ratio

<a href="http://i.imgur.com/Aa3fOMG.png" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">http://i.imgur.com/Aa3fOMG.png
<a href="http://i.imgur.com/Aa3fOMG.png" style="line-height:18px;text-align:center;margin-left:1em;margin-right:1em" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fi.imgur.com%2FAa3fOMG.png\46sa\75D\46sntz\0751\46usg\75AFQjCNEizgIZGLSne3MPTT3GD1HXN9JCiw';return true;">


### Cluster Setup

* Tribes that link to 2 clusters
* Cluster 1
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 2 hourly indices (1 for syslog, 1 for application logs)
    * 1 replica
    * Each index ~ 2 million docs (6gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)
* Cluster 2
  * 3 masters (vms, master=true, data=false)
  * 2 hot nodes (physical, master=false, data=true)
    * 1 hourly index
    * 1 replica
    * Each index ~ 8 million docs (20gb - excl. of replica)
    * Rolled to cold nodes after 48 hrs
  * 2 cold nodes (physical, master=false, data=true)

Interestingly, we're actually having problems on Cluster 1's hot nodes even though it indexes less.

It suggests that this is a problem with searching because Cluster 1 is searched on a lot more.

### Machine settings (hot node)

* java
  * java version "1.7.0_11"
  * Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
  * Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
* 128gb ram
* 8 cores, 32 cpus
* ssds (raid 0)

### JVM settings

```
java
-Xms96g -Xmx96g -Xss256k
-Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime -Xloggc:/var/log/elasticsearch/gc.log -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M
-Xloggc:[...]
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=[...]
-Dcom.sun.management.jmxremote.ssl=[...] -Dcom.sun.management.jmxremote.authenticate=[...]
-Dcom.sun.management.jmxremote.port=[...]
-Delasticsearch -Des.pidfile=[...]
-Des.path.home=/usr/share/elasticsearch -cp :/usr/share/elasticsearch/lib/elasticsearch-1.0.1.jar:/usr/share/elasticsearch/lib/*:/usr/share/elasticsearch/lib/sigar/*
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=[...]
-Des.default.path.data=[...]
-Des.default.path.work=[...]
-Des.default.path.conf=/etc/elasticsearch org.elasticsearch.bootstrap.Elasticsearch
```

## Key elasticsearch.yml settings

* threadpool.bulk.type: fixed
* threadpool.bulk.queue_size: 1000
* indices.memory.index_buffer_size: 30%
* index.translog.flush_threshold_ops: 50000
* indices.fielddata.cache.size: 30%


### Search Load (Cluster 1)

* Mainly Kibana3 (queries ES with daily alias that expands to 24 hourly indices)
* Jenkins jobs that constantly run and do many faceting/aggregations for the last hour's of data

### Things we've tried (unsuccesfully)

* GC settings
  * young/old ratio
    * Set young/old ration to 50/50 hoping that things would get GCed before having the chance to move to old.
    * The old grew at a slower rate but still things could not be collected.
  * survivor space ratio
    * Give survivor space a higher ratio of young
    * Increase number of generations to make it to old be 10 (up from 6)
  * Lower cms occupancy ratio
    * Tried 60% hoping to kick GC earlier. GC kicked in earlier but still could not collect.
* Limit filter/field cache
  * indices.fielddata.cache.size: 32GB
  * indices.cache.filter.size: 4GB
* Optimizing index to 1 segment on the 3rd hour
* Limit JVM to 32 gb ram
  * reference: <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html" target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Fguide%2Fcurrent%2F_limiting_memory_usage.html\46sa\75D\46sntz\0751\46usg\75AFQjCNF7Dm8NTwpr4K6j0em3LCx63FlZ2A';return true;">http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
* Limit JVM to 65 gb ram
  * This fulfils the 'leave 50% to the os' principle.
* Read 90.5/7 OOM errors-- memory leak or GC problems?
  * <a href="https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ" target="_blank" onmousedown="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;" onclick="this.href='https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ';return true;">https://groups.google.com/forum/?fromgroups#!searchin/elasticsearch/memory$20leak/elasticsearch/_Zve60xOh_E/N13tlXgkUAwJ
  * But we're not using term filters

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit <a href="https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com?utm_medium=email&amp;utm_source=footer" target="_blank" onmousedown="this.href='https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;" onclick="this.href='https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com?utm_medium\75email\46utm_source\75footer';return true;">https://groups.google.com/d/msgid/elasticsearch/1034b72c-76b0-407a-9dfb-8b0f371f6026%40googlegroups.com.

For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" onmousedown="this.href='https://groups.google.com/d/optout';return true;" onclick="this.href='https://groups.google.com/d/optout';return true;">https://groups.google.com/d/optout.



--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba95c867-c252-443e-bcdb-59ca3d19995c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.