Server crashed or became unresponsive with memory (physical and swap) fully utilised

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux(All versions)

Issue

  • Server crashed with all memory fully consumed
  • System becomes unresponsive and vmcore analysis suggested that system ran out of memory
  • Found server in hung/frozen state, SAR shows memory fully utilized
  • Server rebooted due to OOM
  • Host unavailable due to Out Of Memory
  • System becomes hung, unable to connect via ssh due to memory issue

Resolution

  • Complete utilization of memory and swap can lead a system to hang due to the increased I/O workload, or can cause a crash due to system processes killed by the Out Of Memory killer process.
  • To resolve this issue, one needs to identify the reason for excessive memory consumption and act accordingly. Some possible actions that may be taken depending on the culprit are listed below:
    • Add more RAM to the system.
    • Increase the swap space available to the system, though this can cause a further performance impact due to frequent swapping in and out of data to disk.
    • Tune applications to use less memory where possible.
    • Reallocate the workload to another system.

Root Cause

  • The system was slow because it was out of memory, and there were a lot of processes trying to free memory.

Diagnostic Steps

  • Relevant sar performance data from the system is given below, which shows increase in paging and swapping activity, and almost all of the physical memory and swap used up. The part of memory used up for buffers and cache is comparatively low.

    00:00:00 pswpin/s  pswpout/s
    11:40:59   244.37    3371.47
    
    00:00:00  pgpgin/s  pgpgout/s   fault/s majflt/s
    11:40:59   1047.18   13686.11  14256.36    32.10
    
    
    00:00:00  kbmemfree  kbmemused  %memused  kbbuffers  kbcached  kbswpfree  kbswpused  %swpused  kbswpcad
    11:20:01   25136904    7694080     23.44     150976   5782872   16776792        160      0.00         0
    11:40:59     150428   32680556     99.54       2428     35976          0   16776952    100.00    243696
    
  • The vmcore analysis from a system that was manually crashed with 'sysrq+c' during a low memory period shows:

    crash> kmem -i
                  PAGES        TOTAL      PERCENTAGE
     TOTAL MEM  8207746      31.3 GB         ----
          FREE    37508     146.5 MB    0% of TOTAL MEM
          USED  8170238      31.2 GB   99% of TOTAL MEM
        SHARED     8811      34.4 MB    0% of TOTAL MEM
       BUFFERS      109       436 KB    0% of TOTAL MEM
        CACHED     3053      11.9 MB    0% of TOTAL MEM
          SLAB    15598      60.9 MB    0% of TOTAL MEM
    
    TOTAL HIGH        0            0    0% of TOTAL MEM
     FREE HIGH        0            0    0% of TOTAL HIGH
     TOTAL LOW  8207746      31.3 GB  100% of TOTAL MEM
      FREE LOW    37508     146.5 MB    0% of TOTAL LOW      <--- No free memory
    
    TOTAL SWAP  4194238        16 GB         ----
     SWAP USED  4194237        16 GB   99% of TOTAL SWAP          
     SWAP FREE        1         4 KB    0% of TOTAL SWAP     <--- No free swap
    
  • There's very little memory in shared, buffers, cached and slab, so the used memory is probably almost entirely made up of userspace pages, which  would also explain why the swap was used up so easily.

  • Load is very high at 205, but there's only 49 processes in uninterruptible state - that suggests the rest are runnable processes that don't have a CPU to run on.

    crash> sys | grep LOAD
    LOAD AVERAGE: 205.62, 158.06, 104.87
    
  • There are 144 processes in the run queue:

    crash> ps | grep -c RU
    144
    
  • There are 56 processes running shrink_active_list to reclaim memory:

    crash> foreach RU bt > foreach_RU_bt.out
    crash> !grep -c shrink_active_list foreach_RU_bt.out
    56
    
  • There's a lot of processes in cond_resched() which allows a process to yield a CPU but does not change the processes state from runnable.

    crash> !grep -c __cond_resched foreach_RU_bt.out
    125
    
  • There are 49 processes in Uninterruptable "D" state:

    crash> ps | grep -c UN
    49
    
  • There are 112 in an unknown state (ie. crash can't figure out what state they're in):

    crash> ps | grep -c ??
    112
    
  • What are they doing?

    PID: 1      TASK: ffff81082ffb37a0  CPU: 4   COMMAND: "init"
     #0 [ffff81082ffb9868] schedule at ffffffff80063035
     #1 [ffff81082ffb9940] __cond_resched at ffffffff8008bb2c
     #2 [ffff81082ffb9950] cond_resched at ffffffff8006316e
     #3 [ffff81082ffb9960] shrink_inactive_list at ffffffff800c7246
     #4 [ffff81082ffb9b50] shrink_zone at ffffffff80012d02
     #5 [ffff81082ffb9b90] try_to_free_pages at ffffffff800c8028
     #6 [ffff81082ffb9c10] __alloc_pages at ffffffff8000f271
     #7 [ffff81082ffb9c70] __do_page_cache_readahead at ffffffff80012a62
     #8 [ffff81082ffb9d60] filemap_nopage at ffffffff80013401
     #9 [ffff81082ffb9dc0] __handle_mm_fault at ffffffff80008863
    #10 [ffff81082ffb9e60] do_page_fault at ffffffff80066b9a
    #11 [ffff81082ffb9f50] error_exit at ffffffff8005dde9
        RIP: 0000000000406f5d  RSP: 00007fff3876c7e0  RFLAGS: 00010206
        RAX: 00007fff3876c980  RBX: 0000000000000008  RCX: 0000000000000000
        RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007fff3876cb00
        RBP: 00007fff3876c980   R8: 0000000000000000   R9: 0000000000300000
        R10: 0000000000000000  R11: 00000033c107a5d3  R12: 0000000000006ffc
        R13: 000000000e2ed1bc  R14: 0000000000000000  R15: 0000000000407779
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
    
  • They are all in shrink_zone() trying to free up memory and since their state doesn't change from runnable they are all contending for the same resources which seems excessive.
    *CPUs 0, 1, 3, 4, 5, 6, 7 are all stuck on the same spinlock that's held by CPU 2.

    PID: 5769   TASK: ffff81062d1cc7a0  CPU: 0   COMMAND: "corpact"
     #0 [ffffffff80428f30] crash_nmi_callback at ffffffff80079db6
     #1 [ffffffff80428f40] do_nmi at ffffffff80065942
     #2 [ffffffff80428f50] nmi at ffffffff80064fa7
        [exception RIP: .text.lock.spinlock+5]
        RIP: ffffffff80064cb7  RSP: ffff810157ce1910  RFLAGS: 00000286
        RAX: 0000000000000000  RBX: ffff8107fa8ff2e8  RCX: ffff810108f27088
        RDX: ffff8107fa8ff2e9  RSI: 0000000000000001  RDI: ffff8107fa8ff2e8
        RBP: 0000000000000000   R8: ffff81000002b600   R9: ffff81000002b600
        R10: 0000000057ce1bb8  R11: 0000000000000286  R12: 00000000000002bc
        R13: ffff81000002b600  R14: 0000000000000000  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    --- <NMI exception stack> ---
     #3 [ffff810157ce1910] .text.lock.spinlock at ffffffff80064cb7 (via _spin_lock)
     #4 [ffff810157ce1910] page_lock_anon_vma at ffffffff80021a5f
     #5 [ffff810157ce1920] page_referenced at ffffffff8003ba76
     #6 [ffff810157ce1960] shrink_inactive_list at ffffffff800c72c3
     #7 [ffff810157ce1b50] shrink_zone at ffffffff80012d02
     #8 [ffff810157ce1b90] try_to_free_pages at ffffffff800c8028
     #9 [ffff810157ce1c10] __alloc_pages at ffffffff8000f271
    #10 [ffff810157ce1c70] __do_page_cache_readahead at ffffffff80012a62
    #11 [ffff810157ce1d60] filemap_nopage at ffffffff80013401
    #12 [ffff810157ce1dc0] __handle_mm_fault at ffffffff80008863
    #13 [ffff810157ce1e60] do_page_fault at ffffffff80066b9a
    #14 [ffff810157ce1f50] error_exit at ffffffff8005dde9
        RIP: 00002b60f6adde23  RSP: 00000000416a7660  RFLAGS: 00010246
        RAX: 000000000041c9b0  RBX: 0000000000000000  RCX: 000000001110a008
        RDX: 00000000ffffffff  RSI: 00000000ffffffff  RDI: 00007fffb4277db0
        RBP: 00000000416a7680   R8: 0000000011109810   R9: 0000000000000001
        R10: 00000033c1351a30  R11: 0000000000000202  R12: 00007fffb4277dc0
        R13: 00007fffb4277db0  R14: 00000000416a76e0  R15: 00000000000001e0
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
    
  • So this system is excessively busy but doing very little since only 1 CPU is really doing anything useful here.
    Now to know how many processes are in reclamation on a zone.

    NODE: 0  ZONE: 2  ADDR: ffff81000001b600  NAME: "Normal"
      SIZE: 1310720  PRESENT: 1292800  MIN/LOW/HIGH: 1799/2248/2698
      NR_ACTIVE: 680955  NR_INACTIVE: 538921  FREE: 1770
    
    
    crash> p -d ((struct zone *)0xffff81000001b600)->reclaim_in_progress
    $6 = {
      counter = 268
    }
    

So in the above case it looks 268 process are in page reclaim for the normal zone.

Below is another pattern for the similar memory issue.

  • No explicit OOM but constant allocation failures of page block request in order 0(4kB) size.

    [1080682.276174] pool-ds_am: page allocation failure: order:0, mode:0x400000(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
    [1080682.281342] pool-ds_am: page allocation failure: order:0, mode:0x400000(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
    [1340364.651811] pool-ds_am: page allocation failure: order:0, mode:0x400000(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
    [1340364.999316] pool-ds_am: page allocation failure: order:0, mode:0x400000(GFP_NOWAIT), nodemask=(null),cpuset=/,mems_allowed=0
    
  • Many short-blocked uninterruptible tasks

    crash> ps -S
      RU: 2
      UN: 235  <--
      IN: 735
      ID: 68
      WA: 1
    
    crash> foreach UN ps -m|tail
    ...
    [ 0 00:00:02.181] [UN]  PID: 2837459  TASK: ffff8fe76dfb8000  CPU: 1    COMMAND: "ds_nuagent"
    [ 0 00:00:02.284] [UN]  PID: 2798     TASK: ffff8fe738d14000  CPU: 1    COMMAND: "ds_nuagent"
    [ 0 00:00:02.216] [UN]  PID: 2020     TASK: ffff8fe702230000  CPU: 0    COMMAND: "node /data/env-"
    [ 0 00:00:02.265] [UN]  PID: 121778   TASK: ffff8fe6c20b8000  CPU: 0    COMMAND: "ds_nuagent"
    [ 0 00:00:02.475] [UN]  PID: 4540     TASK: ffff8fe73583c000  CPU: 1    COMMAND: "ds_nuagent"
    
  • All those tasks were blocked waiting for I/O to be completed.

    crash> foreach UN bt|grep __lock_page_killable -c
    234
    
    PID: 4540     TASK: ffff8fe73583c000  CPU: 1    COMMAND: "ds_nuagent"
     #0 [ffffb71905f4bb48] __schedule at ffffffffb160e1f1
     #1 [ffffb71905f4bba0] schedule at ffffffffb160e7e5
     #2 [ffffb71905f4bbb8] io_schedule at ffffffffb160ec12
     #3 [ffffb71905f4bbc8] __lock_page_killable at ffffffffb0e9ed90
     #4 [ffffb71905f4bc60] filemap_fault at ffffffffb0ea19d1
     #5 [ffffb71905f4bd30] __xfs_filemap_fault at ffffffffc02716bd [xfs]
     #6 [ffffb71905f4bd80] __do_fault at ffffffffb0edc1e8
     #7 [ffffb71905f4bda0] handle_pte_fault at ffffffffb0ee16fd
     #8 [ffffb71905f4bde0] __handle_mm_fault at ffffffffb0ee4232
     #9 [ffffb71905f4be98] handle_mm_fault at ffffffffb0ee447a
    #10 [ffffb71905f4bed0] __do_page_fault at ffffffffb0c807c0
    #11 [ffffb71905f4bf20] do_page_fault at ffffffffb0c80a47
    #12 [ffffb71905f4bf50] page_fault at ffffffffb18011ae
        RIP: 00007fb2db958a17  RSP: 000000c0002dd1c8  RFLAGS: 00010202
        RAX: 000000000006cc80  RBX: 00000000fffffffa  RCX: 00007fb2dc914f00
        RDX: 0000000000000556  RSI: 0000000000004448  RDI: 00007fb2dc761500
        RBP: 000000c0002dd1d8   R8: 0000000000000557   R9: 00007fb2dc763fb8
        R10: 00000000001976e0  R11: 000000000000101f  R12: 000000c0002dd0c8
        R13: ffffffffffffffff  R14: 000000c0002d4000  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
    
  • sshd process showed the same which is the reason why ssh connection was unavailable.

    PID: 928      TASK: ffff8fe6c2b7c000  CPU: 1    COMMAND: "sshd"
     #0 [ffffb7190331fb48] __schedule at ffffffffb160e1f1
     #1 [ffffb7190331fba0] schedule at ffffffffb160e7e5
     #2 [ffffb7190331fbb8] io_schedule at ffffffffb160ec12
     #3 [ffffb7190331fbc8] __lock_page_killable at ffffffffb0e9ed90
     #4 [ffffb7190331fc60] filemap_fault at ffffffffb0ea19d1
     #5 [ffffb7190331fd30] __xfs_filemap_fault at ffffffffc02716bd [xfs]
     #6 [ffffb7190331fd80] __do_fault at ffffffffb0edc1e8
     #7 [ffffb7190331fda0] handle_pte_fault at ffffffffb0ee16fd
     #8 [ffffb7190331fde0] __handle_mm_fault at ffffffffb0ee4232
     #9 [ffffb7190331fe98] handle_mm_fault at ffffffffb0ee447a
    #10 [ffffb7190331fed0] __do_page_fault at ffffffffb0c807c0
    #11 [ffffb7190331ff20] do_page_fault at ffffffffb0c80a47
    #12 [ffffb7190331ff50] page_fault at ffffffffb18011ae
        RIP: 00007fd97866eab0  RSP: 00007ffc4d5a2798  RFLAGS: 00010246
        RAX: 0000000000000000  RBX: 0000000000000030  RCX: 00007ffc4d5a284c
        RDX: 0000000000000036  RSI: 00007fd97883e70b  RDI: 0000000000000030
        RBP: 0000000000000000   R8: 00007fd978ab7d00   R9: 0000000000000000
        R10: 000000007fffffff  R11: 0000000000000246  R12: 000055bd56573580
        R13: 0000000000000010  R14: 000055bd56573580  R15: 00007ffc4d5a284c
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
    
  • Typically shows with busy disk I/O requests

    crash> dev -D
    MAJOR GENDISK            NAME       REQUEST_QUEUE      TOTAL ASYNC  SYNC
        8 ffff8fe6c8b8f000   sdc        ffff8fe6c8b97488     239   239     0
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments