Memory not released upon shutdown

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Memory not released upon shutdown

Ivan Brusic
Question for the memory gurus. Potentially not an ElasticSearch/Lucene issue, but hopefully there are some settings to tweak to help out.

Running ES 0.20, default NIO filesystem, on a CentOS VM with 48 gigs. Could not start up ES with 24 gigs of heap and mlockall enabled. Disabling mlockall, with the same heap size, made ES work again. free shows 25 gigs in use with ES not running and top shows no process with any significant memory utilization. I do not have stats for the VM after a clean reboot. Is there something within Lucene that grabs memory and never releases it? I am used to profiling memory usages for running Java processes, but am clueless when it comes to process that are not running!

Cheers,

Ivan


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

joergprante@gmail.com
Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB which is allowed to mlockall(). Maybe you need to adjust it in /etc/security/limits.conf (followed by a re-login)

Cheers,

Jörg

On Tuesday, November 13, 2012 8:35:15 AM UTC+1, Ivan Brusic wrote:
Question for the memory gurus. Potentially not an ElasticSearch/Lucene issue, but hopefully there are some settings to tweak to help out.

Running ES 0.20, default NIO filesystem, on a CentOS VM with 48 gigs. Could not start up ES with 24 gigs of heap and mlockall enabled. Disabling mlockall, with the same heap size, made ES work again. free shows 25 gigs in use with ES not running and top shows no process with any significant memory utilization. I do not have stats for the VM after a clean reboot. Is there something within Lucene that grabs memory and never releases it? I am used to profiling memory usages for running Java processes, but am clueless when it comes to process that are not running!

Cheers,

Ivan


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

Igor Motov-3
What does slabtop return in "Active / Total Size" line?

On Tuesday, November 13, 2012 6:11:33 AM UTC-5, Jörg Prante wrote:
Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB which is allowed to mlockall(). Maybe you need to adjust it in /etc/security/limits.conf (followed by a re-login)

Cheers,

Jörg

On Tuesday, November 13, 2012 8:35:15 AM UTC+1, Ivan Brusic wrote:
Question for the memory gurus. Potentially not an ElasticSearch/Lucene issue, but hopefully there are some settings to tweak to help out.

Running ES 0.20, default NIO filesystem, on a CentOS VM with 48 gigs. Could not start up ES with 24 gigs of heap and mlockall enabled. Disabling mlockall, with the same heap size, made ES work again. free shows 25 gigs in use with ES not running and top shows no process with any significant memory utilization. I do not have stats for the VM after a clean reboot. Is there something within Lucene that grabs memory and never releases it? I am used to profiling memory usages for running Java processes, but am clueless when it comes to process that are not running!

Cheers,

Ivan


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

Ivan Brusic
In reply to this post by joergprante@gmail.com
Responses inline.
On Tue, Nov 13, 2012 at 3:11 AM, Jörg Prante <[hidden email]> wrote:
Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB which is allowed to mlockall(). Maybe you need to adjust it in /etc/security/limits.conf (followed by a re-login)

Yes, I should have mentioned that the ulimits are correctly set. Setting the heap to 20GBs and mlockall works since there is enough memory available, but the system has a lot more.

 
What does slabtop return in "Active / Total Size" line?

All measurements are on a VM that has been rebooted (with ES in rc.d), but has not seen any queries. Heap set to 24 and using mlockall.

$ free -m
                      total       used       free     shared    buffers     cached
Mem:         48264      25703      22561          0         70        188

 Active / Total Objects (% used)    : 963790 / 971343 (99.2%)
 Active / Total Slabs (% used)      : 15839 / 15848 (99.9%)
 Active / Total Caches (% used)     : 105 / 185 (56.8%)
 Active / Total Size (% used)       : 63254.20K / 64421.21K (98.2%)
 Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

$ /etc/init.d/elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.


$ free -m
                      total       used       free     shared    buffers     cached
Mem:         48264        950      47314          0         70        188

 Active / Total Objects (% used)    : 959518 / 969149 (99.0%)
 Active / Total Slabs (% used)      : 15576 / 15592 (99.9%)
 Active / Total Caches (% used)     : 102 / 185 (55.1%)
 Active / Total Size (% used)       : 61793.27K / 63453.22K (97.4%)
 Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

Ivan Brusic
The numbers below are for a different node than the one I first reported on. Assumed the problem would have been generic, but it might be specific to a single node (for now).

Free shows the memory in use although ES has been stopped. Service does not come back up.

$ free -m
             total       used       free     shared    buffers     cached
Mem:         48265      46150       2114          0         96        188

$ /etc/init.d/elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.

$ free -m
             total       used       free     shared    buffers     cached
Mem:         48265      25470      22794          0         96        188

 Active / Total Objects (% used)    : 965687 / 976712 (98.9%)
 Active / Total Slabs (% used)      : 15743 / 15746 (100.0%)
 Active / Total Caches (% used)     : 101 / 185 (54.6%)
 Active / Total Size (% used)       : 62337.81K / 63982.26K (97.4%)
 Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

$ /etc/init.d/elasticsearch start
Starting ElasticSearch...
Waiting for ElasticSearch..................................................................
running: PID:4834

$ tail /opt/elasticsearch/logs/service.log 
STATUS | wrapper  | 2012/11/13 07:53:54 | JVM process is gone.
ERROR  | wrapper  | 2012/11/13 07:53:54 | JVM exited unexpectedly.
STATUS | wrapper  | 2012/11/13 07:53:59 | Launching a JVM...
INFO   | jvm 5    | 2012/11/13 07:54:01 | WrapperManager: Initializing...
STATUS | wrapper  | 2012/11/13 07:54:19 | JVM received a signal SIGKILL (9).
STATUS | wrapper  | 2012/11/13 07:54:19 | JVM process is gone.
ERROR  | wrapper  | 2012/11/13 07:54:19 | JVM exited unexpectedly.
FATAL  | wrapper  | 2012/11/13 07:54:19 | There were 5 failed launches in a row, each lasting less than 300 seconds.  Giving up.
FATAL  | wrapper  | 2012/11/13 07:54:19 |   There may be a configuration problem: please check the logs.
STATUS | wrapper  | 2012/11/13 07:54:19 | <-- Wrapper Stopped

Probably a VM tuning issue.

Cheers,

Ivan

On Tue, Nov 13, 2012 at 7:46 AM, Ivan Brusic <[hidden email]> wrote:
Responses inline.
On Tue, Nov 13, 2012 at 3:11 AM, Jörg Prante <[hidden email]> wrote:
Hi Ivan,

are you on Linux, have you checked ulimit -l ? That is the size in KB which is allowed to mlockall(). Maybe you need to adjust it in /etc/security/limits.conf (followed by a re-login)

Yes, I should have mentioned that the ulimits are correctly set. Setting the heap to 20GBs and mlockall works since there is enough memory available, but the system has a lot more.

 
What does slabtop return in "Active / Total Size" line?

All measurements are on a VM that has been rebooted (with ES in rc.d), but has not seen any queries. Heap set to 24 and using mlockall.

$ free -m
                      total       used       free     shared    buffers     cached
Mem:         48264      25703      22561          0         70        188

 Active / Total Objects (% used)    : 963790 / 971343 (99.2%)
 Active / Total Slabs (% used)      : 15839 / 15848 (99.9%)
 Active / Total Caches (% used)     : 105 / 185 (56.8%)
 Active / Total Size (% used)       : 63254.20K / 64421.21K (98.2%)
 Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K

$ /etc/init.d/elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.


$ free -m
                      total       used       free     shared    buffers     cached
Mem:         48264        950      47314          0         70        188

 Active / Total Objects (% used)    : 959518 / 969149 (99.0%)
 Active / Total Slabs (% used)      : 15576 / 15592 (99.9%)
 Active / Total Caches (% used)     : 102 / 185 (55.1%)
 Active / Total Size (% used)       : 61793.27K / 63453.22K (97.4%)
 Minimum / Average / Maximum Object : 0.02K / 0.07K / 4096.00K
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

joergprante@gmail.com
Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization wants to be smart and tries to re-allocate in several steps up to the mem size given in Xms to allocate the initial heap completely. On the other hand, mlockall() is a single call via JNA, and this is not so smart. This is certainly the reason why you observe mlockall() failures before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of seconds or even minutes, you should reconsider your requirements. Extra large heaps do not give extra large performance, quite contrary, they are not good for performance. 24 GB is too much for the current standard JVM to handle. You will get better and predictable performance with heaps of 4-8GB, because the CMS garbage collector is targeted to perform well in that range. See also http://openjdk.java.net/jeps/144 for an enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7 Oracle JVM http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

Ivan Brusic
Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem:         48264        950      47314          0         70        188
Mem:         48265      25470      22794          0         96        188

The latter node never releases the memory allocated toward it. Will be upgrading to JDK7 shortly since there are various new GC options I want to try out. But I would like to try things out with a clean slate and would love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante <[hidden email]> wrote:
Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization wants to be smart and tries to re-allocate in several steps up to the mem size given in Xms to allocate the initial heap completely. On the other hand, mlockall() is a single call via JNA, and this is not so smart. This is certainly the reason why you observe mlockall() failures before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of seconds or even minutes, you should reconsider your requirements. Extra large heaps do not give extra large performance, quite contrary, they are not good for performance. 24 GB is too much for the current standard JVM to handle. You will get better and predictable performance with heaps of 4-8GB, because the CMS garbage collector is targeted to perform well in that range. See also http://openjdk.java.net/jeps/144 for an enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7 Oracle JVM http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

kimchy
Administrator
Hi, a few notes here:

1. The main reason mlockall is there is to make sure the memory (ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped. You can achieve that in other means, like setting swappiness. The reason you don't want a java process to swap is because of the way the garbage collector works, having to touch different parts of the process memory, causing it to swap in and out a lot of pages.
2. Its perfectly fine to run elasticsearch with 24gb of memory, and even more. You won't observe large pauses. We work hard in elasticsearch to make sure we work nicely with the garbage collector to eliminate those pauses. Many users run elasticsearch with 30gb of memory in production.
3. The more memory you have for the java process, the more memory can be used for things like filter cache (its automatically using 20% of the heap by default) and other related memory constructs. Leaving memory to the OS is also important so the OS file system cache do its magic as well. Usually, we recommend around 50% of OS memory to be allocate to the java process, but prefer to not allocate more than 30gb (because then the JVM can be smart and compress pointers sizes).

Regarding memory not being released, thats strange. Can you double check that there isn't a process still running? Once the process no longer exists, it will not take the mentioned memory.

On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:
Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem:         48264        950      47314          0         70        188
Mem:         48265      25470      22794          0         96        188

The latter node never releases the memory allocated toward it. Will be upgrading to JDK7 shortly since there are various new GC options I want to try out. But I would like to try things out with a clean slate and would love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="xMQB9LhyN74J">joerg...@...> wrote:
Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization wants to be smart and tries to re-allocate in several steps up to the mem size given in Xms to allocate the initial heap completely. On the other hand, mlockall() is a single call via JNA, and this is not so smart. This is certainly the reason why you observe mlockall() failures before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of seconds or even minutes, you should reconsider your requirements. Extra large heaps do not give extra large performance, quite contrary, they are not good for performance. 24 GB is too much for the current standard JVM to handle. You will get better and predictable performance with heaps of 4-8GB, because the CMS garbage collector is targeted to perform well in that range. See also http://openjdk.java.net/jeps/144 for an enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7 Oracle JVM http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--
 
 

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

mrflip
Necro'ing the thread to say we may be seeing a version of this. We have a uniform cluster of eight machines that run two systems: a transfer-only elasticsearch node (no data, no master and no http), with 1GB heap and mlockall=true; and a Storm+Trident topology that reads and writes several thousand records per second in batch requests using the Java Client API. On all the machines, over the course of a couple weeks -- untouched, in steady state -- the memory usage of the processes does not change, but the amount of free ram reported on the machine does. 

The machine claims (see `free -m`) to be using 5.7 GB out of 6.9GB ram, not counting the OS buffers+caches. Yet the `ps aux` output shows the amount of ram taken by active processes is only about 2.5GB -- there are 3+ missing GB of data. Meminfo shows that there is about 2.5GB of slab cache, and it is almost entirely consumed (says slabtop) by 'dentries': 605k slabs for 2.5GB ram on 12 M objects.

I can't say for sure whether this is a Storm thing or an ES thing, but It's pretty clear that something is presenting Linux with an infinitely fascinating number of ephemeral directories to cache. Does that sound like anything ES/Lucene could produce? Given that it takes a couple weeks to create the problem, we're unlikely to be able to do experiments. (We are going increase the `vfs_cache_pressure` vaule to 10000 and otherwise just keep a close eye on things).

___________________

In case anyone else hits this, here are some relevant things to google for (proceed at your own risk):

* Briefly exerting some memory pressure on one of these nodes (`sort -S 500M`) made it reclaim some of the slab cache -- its population declined to what you see below. My understanding is that the system will reclaim data from the slab cache exactly as needed. (Basically: this is not an implementation bug in the system producing the large slab occupancy, it's a UX bug in that htop, free and our monitoring tool don't include it under bufs+caches.) It at least makes monitoring a pain.

* [`vfs_cache_pressure`](https://www.kernel.org/doc/Documentation/sysctl/vm.txt): "Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes."
* From [this SO thread](http://stackoverflow.com/questions/5463800/linux-memory-reporting-discrepancy), "If the slab cache is resposible for a large portion of your "missing memory", check /proc/slabinfo to see where it's gone. If it's dentries or inodes, you can use `sudo bash -c 'sync ; echo 2 > /proc/sys/vm/drop_caches'` to get rid of them"

___________________


free -m

                 total       used       free     shared    buffers     cached
    Mem:          6948       5725       1222          0        268        462
    -/+ buffers/cache:       4994       1953
    Swap:            0          0          0

ps aux | sort -rnk6 | head -n 20 | cut -c 1-100

    USER       PID %CPU %MEM     VSZ    RSS TTY    STAT START TIME    COMMAND
    61021     5170  9.9 13.5 5449368 960588 ?      Sl   Jun28 2890:23 java (elasticsearch)
    storm    22628 41.2  9.1 4477532 653556 ?      Sl   Jul01 9775:58 java (trident state)
    storm    22623  6.0  1.8 3212816 133268 ?      Sl   Jul01 1438:13 java (trident wu)
    storm    22621  6.0  1.8 3212816 129300 ?      Sl   Jul01 1423:30 java (trident wu)
    storm    22625  6.1  1.8 3212816 128320 ?      Sl   Jul01 1450:38 java (trident wu)
    storm    22631  6.2  1.7 3212816 125740 ?      Sl   Jul01 1481:30 java (trident wu)
    storm     5629  0.4  1.6 3576976 114916 ?      Sl   Jun28  140:35 java (storm supervisor)
    storm    22814 23.5  0.4  116240  34584 ?      Sl   Jul01 5577:39 ruby (wu)
    storm    22822 23.4  0.4  116204  34548 ?      Sl   Jul01 5552:50 ruby (wu)
    storm    22806 23.4  0.4  116200  34544 ?      Sl   Jul01 5554:17 ruby (wu)
    storm    22830 23.3  0.4  116180  34524 ?      Sl   Jul01 5534:38 ruby (wu)
    flip      7928  0.0  0.1   25352   7900 pts/4  Ss   06:31    0:00 -bash
    flip     10268  0.0  0.0   25352   6548 pts/4  S+   06:51    0:00 -bash
    syslog     718  0.0  0.0  254488   5024 ?      Sl   Apr05   15:30 rsyslogd -c5
    root      7725  0.0  0.0   73360   3576 ?      Ss   06:31    0:00 sshd: flip [priv]
    flip      7927  0.0  0.0   73360   1676 ?      S    06:31    0:00 sshd: flip@pts/4
    whoopsie   836  0.0  0.0  187588   1628 ?      Ssl  Apr05    0:00 whoopsie
    root         1  0.0  0.0   24460   1476 ?      Ss   Apr05    0:57 /sbin/init
    flip     10272  0.0  0.0   16884   1260 pts/4  R+   06:51    0:00 /bin/ps aux

slabtop

     Active / Total Objects (% used)    : 12069032 / 13126009 (91.9%)
     Active / Total Slabs (% used)      : 615122 / 615122 (100.0%)
     Active / Total Caches (% used)     : 68 / 106 (64.2%)
     Active / Total Size (% used)       : 2270155.02K / 2467052.45K (92.0%)
     Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K

        OBJS   ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
    12720456 11688175  91%    0.19K 605736       21   2422944K dentry
      182091   163690  89%    0.10K   4669       39     18676K buffer_head
       22496    22405  99%    0.86K    608       37     19456K ext4_inode_cache
       21760    21760 100%    0.02K     85      256       340K ext4_io_page
       21504    21504 100%    0.01K     42      512       168K kmalloc-8
       17680    16830  95%    0.02K    104      170       416K numa_policy
       11475     9558  83%    0.05K    135       85       540K shared_policy_node
       ...

dentry-state

    sudo cat /proc/sys/fs/dentry-state
    11688070        11677721        45      0       0       0

see http://linux.about.com/library/cmd/blcmdl5_slabinfo.htm
sudo cat /proc/slabinfo | sort -rnk2 | head

    dentry            11689648 12720456    192   21    1 : tunables 0 0 0 : slabdata 605736 605736 0
    buffer_head         163690   182091    104   39    1 : tunables 0 0 0 : slabdata   4669   4669 0
    ext4_inode_cache     22405    22496    880   37    8 : tunables 0 0 0 : slabdata    608    608 0
    ext4_io_page         21760    21760     16  256    1 : tunables 0 0 0 : slabdata     85     85 0
    kmalloc-8            21504    21504      8  512    1 : tunables 0 0 0 : slabdata     42     42 0
    numa_policy          16830    17680     24  170    1 : tunables 0 0 0 : slabdata    104    104 0
    sysfs_dir_cache      11396    11396    144   28    1 : tunables 0 0 0 : slabdata    407    407 0
    kmalloc-64           11072    11072     64   64    1 : tunables 0 0 0 : slabdata    173    173 0
    kmalloc-32            9344     9344     32  128    1 : tunables 0 0 0 : slabdata     73     73 0


sudo cat /proc/meminfo

    MemTotal:        7114792 kB
    MemFree:         1443160 kB
    Buffers:          275232 kB
    Cached:           446828 kB
    SwapCached:            0 kB
    Active:          2810096 kB
    Inactive:         240064 kB
    Active(anon):    2299088 kB
    Inactive(anon):      720 kB
    Active(file):     511008 kB
    Inactive(file):   239344 kB
    Unevictable:           0 kB
    Mlocked:               0 kB
    SwapTotal:             0 kB
    SwapFree:              0 kB
    Dirty:               260 kB
    Writeback:             0 kB
    AnonPages:       2299184 kB
    Mapped:            27944 kB
    Shmem:               772 kB
    Slab:            2506124 kB
    SReclaimable:    2479280 kB
    SUnreclaim:        26844 kB
    KernelStack:        3512 kB
    PageTables:        12968 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     7114792 kB
    Committed_AS:    2626600 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:       26116 kB
    VmallocChunk:   34359710188 kB
    HardwareCorrupted:     0 kB
    AnonHugePages:         0 kB
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    DirectMap4k:     7348224 kB
    DirectMap2M:           0 kB


On Wednesday, November 14, 2012 6:50:24 AM UTC-6, kimchy wrote:
Hi, a few notes here:

1. The main reason mlockall is there is to make sure the memory (ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped. You can achieve that in other means, like setting swappiness. The reason you don't want a java process to swap is because of the way the garbage collector works, having to touch different parts of the process memory, causing it to swap in and out a lot of pages.
2. Its perfectly fine to run elasticsearch with 24gb of memory, and even more. You won't observe large pauses. We work hard in elasticsearch to make sure we work nicely with the garbage collector to eliminate those pauses. Many users run elasticsearch with 30gb of memory in production.
3. The more memory you have for the java process, the more memory can be used for things like filter cache (its automatically using 20% of the heap by default) and other related memory constructs. Leaving memory to the OS is also important so the OS file system cache do its magic as well. Usually, we recommend around 50% of OS memory to be allocate to the java process, but prefer to not allocate more than 30gb (because then the JVM can be smart and compress pointers sizes).

Regarding memory not being released, thats strange. Can you double check that there isn't a process still running? Once the process no longer exists, it will not take the mentioned memory.

On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:
Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem:         48264        950      47314          0         70        188
Mem:         48265      25470      22794          0         96        188

The latter node never releases the memory allocated toward it. Will be upgrading to JDK7 shortly since there are various new GC options I want to try out. But I would like to try things out with a clean slate and would love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante <[hidden email]> wrote:
Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization wants to be smart and tries to re-allocate in several steps up to the mem size given in Xms to allocate the initial heap completely. On the other hand, mlockall() is a single call via JNA, and this is not so smart. This is certainly the reason why you observe mlockall() failures before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of seconds or even minutes, you should reconsider your requirements. Extra large heaps do not give extra large performance, quite contrary, they are not good for performance. 24 GB is too much for the current standard JVM to handle. You will get better and predictable performance with heaps of 4-8GB, because the CMS garbage collector is targeted to perform well in that range. See also http://openjdk.java.net/jeps/144 for an enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7 Oracle JVM http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

Igor Motov-3
I saw something like this about 1.5 years ago with an old version of Ubuntu (hence my question about slabtop). It went away after upgrading kernel to the latest version. Which version of kernel are you running?

On Thursday, July 18, 2013 3:50:04 AM UTC-4, Philip (Flip) Kromer wrote:
Necro'ing the thread to say we may be seeing a version of this. We have a uniform cluster of eight machines that run two systems: a transfer-only elasticsearch node (no data, no master and no http), with 1GB heap and mlockall=true; and a Storm+Trident topology that reads and writes several thousand records per second in batch requests using the Java Client API. On all the machines, over the course of a couple weeks -- untouched, in steady state -- the memory usage of the processes does not change, but the amount of free ram reported on the machine does. 

The machine claims (see `free -m`) to be using 5.7 GB out of 6.9GB ram, not counting the OS buffers+caches. Yet the `ps aux` output shows the amount of ram taken by active processes is only about 2.5GB -- there are 3+ missing GB of data. Meminfo shows that there is about 2.5GB of slab cache, and it is almost entirely consumed (says slabtop) by 'dentries': 605k slabs for 2.5GB ram on 12 M objects.

I can't say for sure whether this is a Storm thing or an ES thing, but It's pretty clear that something is presenting Linux with an infinitely fascinating number of ephemeral directories to cache. Does that sound like anything ES/Lucene could produce? Given that it takes a couple weeks to create the problem, we're unlikely to be able to do experiments. (We are going increase the `vfs_cache_pressure` vaule to 10000 and otherwise just keep a close eye on things).

___________________

In case anyone else hits this, here are some relevant things to google for (proceed at your own risk):

* Briefly exerting some memory pressure on one of these nodes (`sort -S 500M`) made it reclaim some of the slab cache -- its population declined to what you see below. My understanding is that the system will reclaim data from the slab cache exactly as needed. (Basically: this is not an implementation bug in the system producing the large slab occupancy, it's a UX bug in that htop, free and our monitoring tool don't include it under bufs+caches.) It at least makes monitoring a pain.

* [`vfs_cache_pressure`](https://www.kernel.org/doc/Documentation/sysctl/vm.txt): "Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes."
* From [this SO thread](http://stackoverflow.com/questions/5463800/linux-memory-reporting-discrepancy), "If the slab cache is resposible for a large portion of your "missing memory", check /proc/slabinfo to see where it's gone. If it's dentries or inodes, you can use `sudo bash -c 'sync ; echo 2 > /proc/sys/vm/drop_caches'` to get rid of them"

___________________


free -m

                 total       used       free     shared    buffers     cached
    Mem:          6948       5725       1222          0        268        462
    -/+ buffers/cache:       4994       1953
    Swap:            0          0          0

ps aux | sort -rnk6 | head -n 20 | cut -c 1-100

    USER       PID %CPU %MEM     VSZ    RSS TTY    STAT START TIME    COMMAND
    61021     5170  9.9 13.5 5449368 960588 ?      Sl   Jun28 2890:23 java (elasticsearch)
    storm    22628 41.2  9.1 4477532 653556 ?      Sl   Jul01 9775:58 java (trident state)
    storm    22623  6.0  1.8 3212816 133268 ?      Sl   Jul01 1438:13 java (trident wu)
    storm    22621  6.0  1.8 3212816 129300 ?      Sl   Jul01 1423:30 java (trident wu)
    storm    22625  6.1  1.8 3212816 128320 ?      Sl   Jul01 1450:38 java (trident wu)
    storm    22631  6.2  1.7 3212816 125740 ?      Sl   Jul01 1481:30 java (trident wu)
    storm     5629  0.4  1.6 3576976 114916 ?      Sl   Jun28  140:35 java (storm supervisor)
    storm    22814 23.5  0.4  116240  34584 ?      Sl   Jul01 5577:39 ruby (wu)
    storm    22822 23.4  0.4  116204  34548 ?      Sl   Jul01 5552:50 ruby (wu)
    storm    22806 23.4  0.4  116200  34544 ?      Sl   Jul01 5554:17 ruby (wu)
    storm    22830 23.3  0.4  116180  34524 ?      Sl   Jul01 5534:38 ruby (wu)
    flip      7928  0.0  0.1   25352   7900 pts/4  Ss   06:31    0:00 -bash
    flip     10268  0.0  0.0   25352   6548 pts/4  S+   06:51    0:00 -bash
    syslog     718  0.0  0.0  254488   5024 ?      Sl   Apr05   15:30 rsyslogd -c5
    root      7725  0.0  0.0   73360   3576 ?      Ss   06:31    0:00 sshd: flip [priv]
    flip      7927  0.0  0.0   73360   1676 ?      S    06:31    0:00 sshd: flip@pts/4
    whoopsie   836  0.0  0.0  187588   1628 ?      Ssl  Apr05    0:00 whoopsie
    root         1  0.0  0.0   24460   1476 ?      Ss   Apr05    0:57 /sbin/init
    flip     10272  0.0  0.0   16884   1260 pts/4  R+   06:51    0:00 /bin/ps aux

slabtop

     Active / Total Objects (% used)    : 12069032 / 13126009 (91.9%)
     Active / Total Slabs (% used)      : 615122 / 615122 (100.0%)
     Active / Total Caches (% used)     : 68 / 106 (64.2%)
     Active / Total Size (% used)       : 2270155.02K / 2467052.45K (92.0%)
     Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K

        OBJS   ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
    12720456 11688175  91%    0.19K 605736       21   2422944K dentry
      182091   163690  89%    0.10K   4669       39     18676K buffer_head
       22496    22405  99%    0.86K    608       37     19456K ext4_inode_cache
       21760    21760 100%    0.02K     85      256       340K ext4_io_page
       21504    21504 100%    0.01K     42      512       168K kmalloc-8
       17680    16830  95%    0.02K    104      170       416K numa_policy
       11475     9558  83%    0.05K    135       85       540K shared_policy_node
       ...

dentry-state

    sudo cat /proc/sys/fs/dentry-state
    11688070        11677721        45      0       0       0

sudo cat /proc/slabinfo | sort -rnk2 | head

    dentry            11689648 12720456    192   21    1 : tunables 0 0 0 : slabdata 605736 605736 0
    buffer_head         163690   182091    104   39    1 : tunables 0 0 0 : slabdata   4669   4669 0
    ext4_inode_cache     22405    22496    880   37    8 : tunables 0 0 0 : slabdata    608    608 0
    ext4_io_page         21760    21760     16  256    1 : tunables 0 0 0 : slabdata     85     85 0
    kmalloc-8            21504    21504      8  512    1 : tunables 0 0 0 : slabdata     42     42 0
    numa_policy          16830    17680     24  170    1 : tunables 0 0 0 : slabdata    104    104 0
    sysfs_dir_cache      11396    11396    144   28    1 : tunables 0 0 0 : slabdata    407    407 0
    kmalloc-64           11072    11072     64   64    1 : tunables 0 0 0 : slabdata    173    173 0
    kmalloc-32            9344     9344     32  128    1 : tunables 0 0 0 : slabdata     73     73 0


sudo cat /proc/meminfo

    MemTotal:        7114792 kB
    MemFree:         1443160 kB
    Buffers:          275232 kB
    Cached:           446828 kB
    SwapCached:            0 kB
    Active:          2810096 kB
    Inactive:         240064 kB
    Active(anon):    2299088 kB
    Inactive(anon):      720 kB
    Active(file):     511008 kB
    Inactive(file):   239344 kB
    Unevictable:           0 kB
    Mlocked:               0 kB
    SwapTotal:             0 kB
    SwapFree:              0 kB
    Dirty:               260 kB
    Writeback:             0 kB
    AnonPages:       2299184 kB
    Mapped:            27944 kB
    Shmem:               772 kB
    Slab:            2506124 kB
    SReclaimable:    2479280 kB
    SUnreclaim:        26844 kB
    KernelStack:        3512 kB
    PageTables:        12968 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     7114792 kB
    Committed_AS:    2626600 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:       26116 kB
    VmallocChunk:   34359710188 kB
    HardwareCorrupted:     0 kB
    AnonHugePages:         0 kB
    HugePages_Total:       0
    HugePages_Free:        0
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    DirectMap4k:     7348224 kB
    DirectMap2M:           0 kB


On Wednesday, November 14, 2012 6:50:24 AM UTC-6, kimchy wrote:
Hi, a few notes here:

1. The main reason mlockall is there is to make sure the memory (ES_HEAP_SIZE) allocated to elasticsearch java process will not be swapped. You can achieve that in other means, like setting swappiness. The reason you don't want a java process to swap is because of the way the garbage collector works, having to touch different parts of the process memory, causing it to swap in and out a lot of pages.
2. Its perfectly fine to run elasticsearch with 24gb of memory, and even more. You won't observe large pauses. We work hard in elasticsearch to make sure we work nicely with the garbage collector to eliminate those pauses. Many users run elasticsearch with 30gb of memory in production.
3. The more memory you have for the java process, the more memory can be used for things like filter cache (its automatically using 20% of the heap by default) and other related memory constructs. Leaving memory to the OS is also important so the OS file system cache do its magic as well. Usually, we recommend around 50% of OS memory to be allocate to the java process, but prefer to not allocate more than 30gb (because then the JVM can be smart and compress pointers sizes).

Regarding memory not being released, thats strange. Can you double check that there isn't a process still running? Once the process no longer exists, it will not take the mentioned memory.

On Tuesday, November 13, 2012 7:18:44 PM UTC+1, Ivan Brusic wrote:
Thanks Jörg.

I completely understand why the JVM refuses to start with mlockall, the question is why is there not enough free memory to begin with?

The difference between the nodes after ES has stopped:

Mem:         48264        950      47314          0         70        188
Mem:         48265      25470      22794          0         96        188

The latter node never releases the memory allocated toward it. Will be upgrading to JDK7 shortly since there are various new GC options I want to try out. But I would like to try things out with a clean slate and would love to resolve the memory issue.

Ivan

On Tue, Nov 13, 2012 at 10:00 AM, Jörg Prante <[hidden email]> wrote:
Hi Ivan,

depending on the underlying OS memory organization, the JVM initialization wants to be smart and tries to re-allocate in several steps up to the mem size given in Xms to allocate the initial heap completely. On the other hand, mlockall() is a single call via JNA, and this is not so smart. This is certainly the reason why you observe mlockall() failures before Xms heap allocation fails.

Since the standard JVM can not handle large heaps without stalls of seconds or even minutes, you should reconsider your requirements. Extra large heaps do not give extra large performance, quite contrary, they are not good for performance. 24 GB is too much for the current standard JVM to handle. You will get better and predictable performance with heaps of 4-8GB, because the CMS garbage collector is targeted to perform well in that range. See also http://openjdk.java.net/jeps/144 for an enhancement call to create a better, scalable GC for larger RAMs.

Maybe you are interested in activating the G1 garbage collector in Java 7 Oracle JVM http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html

Cheers,

Jörg

--
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Memory not released upon shutdown

joergprante@gmail.com
In reply to this post by mrflip
Am 18.07.13 09:50, schrieb Philip (Flip) Kromer:

> Necro'ing the thread to say we may be seeing a version of this. We
> have a uniform cluster of eight machines that run two systems: a
> transfer-only elasticsearch node (no data, no master and no http),
> with 1GB heap and mlockall=true; and a Storm+Trident topology that
> reads and writes several thousand records per second in batch requests
> using the Java Client API. On all the machines, over the course of a
> couple weeks -- untouched, in steady state -- the memory usage of the
> processes does not change, but the amount of free ram reported on the
> machine does.
>
> The machine claims (see `free -m`) to be using 5.7 GB out of 6.9GB
> ram, not counting the OS buffers+caches. Yet the `ps aux` output shows
> the amount of ram taken by active processes is only about 2.5GB --
> there are 3+ missing GB of data. Meminfo shows that there is about
> 2.5GB of slab cache, and it is almost entirely consumed (says slabtop)
> by 'dentries': 605k slabs for 2.5GB ram on 12 M objects.
>
> I can't say for sure whether this is a Storm thing or an ES thing, but
> It's pretty clear that something is presenting Linux with an
> infinitely fascinating number of ephemeral directories to cache. Does
> that sound like anything ES/Lucene could produce?

Nothing known like this - the dentry cache is under control of the kernel.

> Given that it takes a couple weeks to create the problem, we're
> unlikely to be able to do experiments. (We are going increase the
> `vfs_cache_pressure` vaule to 10000 and otherwise just keep a close
> eye on things).

Instead, you should upgrade the kernel because of two advantages: better
selinux handling / improved extfs and memory management, all may have
side effects on dentry cache. Check if you have custom IP monitoring /
net traffic filtering kernel modules running, or custom selinux
settings. In general, nothing much to worry, until the kernel begins to
kill processes because of OOM.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.