Backup server running RHEL 5 becomes unresponsive on regular base

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5.9 (RHEL)

Issue

  • Our DL380 G6 regularly stops with hung task messages on the console.

Resolution

  • Decrease value of vm.dirty_ratio from 40 to 20 to initiate writeback sooner and to avoid congestion and potential stalls
    • Change the value in /etc/sysctl.conf to make the change permanent, using your favorite text editor and activate the change running sysctl -p
    • For testing purpose only, change the value using sysctl -w vm.dirty_ratio=20

Root Cause

  • Problem with writeback of dirty pages and be2net driver requiring memory to work correctly. Due to the fact, that the writeback needs to be done on NFS, it seems that the system is ending up in a deadlock situation as writeback needs network (i.e. be2net driver) but be2net drivers requires memory to keep working.

Diagnostic Steps

Configure the system to capture a kdump
  • Configure the system to use kdump and force a crash using SysRq when the system is in unresponsive state
Analyze vmcore
  • The vmcore shows that the system is very low on free memory. Most memory is in use for caching file data, and the majority of these pages are allocated to six files (NFS) only. These six files have been opened by bptm processes. The following table gives a summary (FD = File Descriptor).
    'bptm'  |  number  | file name and
   PID / FD | of pages | mount point
  ----------|----------|-----------------------------------------
            |          |
  12482  19 |  434437  | xy1234_1391574088_C2_F5.1391574088.img
            |          | /dstu_nfs_nf1/pav3 [*]
            |          |
   2392  19 |  194792  | yz2412_1391578204_C1_F1.1391578204.img
   2399  19 |          | /dstu_nfs_nf2/pav4 [**]
            |          |
   2389  19 |  192652  | yz2312_1391578201_C1_F1.1391578201.img
   2396  19 |          | /dstu_nfs_nf2/pav4 [**]
            |          |
   2385  19 |  158263  | yz2512_1391578200_C1_F1.1391578200.img
   2388  19 |          | /dstu_nfs_nf2/pav4 [**]
            |          |
   2381  19 |   85782  | yz9221_1391578202_C1_F1.1391578202.img
   2384  19 |          | /dstu_nfs_nf2/pav4 [**]
            |          |
   2393  19 |   85775  | yz6221_1391578205_C1_F1.1391578205.img
   2400  19 |          | /dstu_nfs_nf2/pav4 [**]
            |          |
  ----------|----------|-----------------------------------------
  total pages 1151701
                          [*] nf1:/vol/nbu_pav3_binz_srv1234/q1
                         [**] nf2:/vol/nbu_pav4_binz_srv1234/q1
  • The bptm processes (PIDs) 12482, 2392, 2389, 2385, and 2393 are executing a write() system call and have the following kernel function call sequence in their stack traces.
...
  nfs_write_begin
    grab_cache_page_write_begin
      __alloc_pages
        try_to_free_pages
          shrink_zone
            throttle_vm_writeout
              blk_congestion_wait
                io_schedule_timeout
                  schedule_timeout
                    schedule
  • It appears that these threads are executing a loop in throttle_vm_writeout().
        for ( ; ; ) {
                get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
                ...
                dirty_thresh += dirty_thresh / 10;      /* wheeee... */

                if (global_page_state(NR_UNSTABLE_NFS) +
                        global_page_state(NR_WRITEBACK) <= dirty_thresh)
                                break;
                blk_congestion_wait(WRITE, HZ/10);
        }
  • Interestingly, the kernel events/2 thread appears to be executing the same loop too. This thread performs a callback of the be_worker() function which belongs to the be2net network driver. The driver - in turn - attempts to allocate memory and is stuck here.
  PID: 28     TASK: ffff81019743d080  CPU: 2   COMMAND: "events/2"           
   #0 [ffff810197461a80] schedule at ffffffff80062fa0                        
   #1 [ffff810197461b58] schedule_timeout at ffffffff8006389f                
   #2 [ffff810197461ba8] io_schedule_timeout at ffffffff80063241             
   #3 [ffff810197461bd8] blk_congestion_wait at ffffffff8003b871             
   #4 [ffff810197461c28] throttle_vm_writeout at ffffffff80053be0            
   #5 [ffff810197461c48] shrink_zone at ffffffff8001331f                     
   #6 [ffff810197461c88] try_to_free_pages at ffffffff800cfd72               
   #7 [ffff810197461d18] __alloc_pages at ffffffff8000f604                   
   #8 [ffff810197461d88] alloc_page_interleave at ffffffff80023d1a           
   #9 [ffff810197461da8] be_post_rx_frags at ffffffff882fa579 [be2net]       
  #10 [ffff810197461df8] be_worker at ffffffff882fe1a6 [be2net] <<-- be2net, be_worker function
  #11 [ffff810197461e38] run_workqueue at ffffffff8004d903
  #12 [ffff810197461e78] worker_thread at ffffffff8004a23c
  #13 [ffff810197461ee8] kthread at ffffffff80032c2b
  #14 [ffff810197461f48] kernel_thread at ffffffff8005dfc1
  • According this information, it seems that the system is either in a kind of deadlock situation, or in a dependency situation where NFS writeback is extremely slow (such that the system appears to be unresponsive).
    • Writeback of dirty pages must occur to free and reclaim memory. However, writeback depends on the be2net driver because the page contents need to be transferred to the NFS server.
    • The be2net driver depends on free memory. The fact that the driver is waiting for free memory may adversely impact the communication to the NFS server (and thus impact the ongoing writeback as well).

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments