1. 26 Oct, 2018 1 commit
    • Dave Chinner's avatar
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · 64081362
      Dave Chinner authored
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      #2 is simple, and I don't think it will have any impact on
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      #2a is pretty much "status quo without the deadlock". Moving the
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      #3 is really just a band-aid - it doesn't fix the access/wait
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      #3a is really an optimisation that just so happens to include the
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64081362
  2. 21 Oct, 2018 1 commit
  3. 30 Aug, 2018 1 commit
  4. 17 Aug, 2018 1 commit
  5. 21 Apr, 2018 1 commit
    • Greg Thelen's avatar
      writeback: safer lock nesting · 2e898e4c
      Greg Thelen authored
      lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
      the page's memcg is undergoing move accounting, which occurs when a
      process leaves its memcg for a new one that has
      memory.move_charge_at_immigrate set.
      
      unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
      the given inode is switching writeback domains.  Switches occur when
      enough writes are issued from a new domain.
      
      This existing pattern is thus suspicious:
          lock_page_memcg(page);
          unlocked_inode_to_wb_begin(inode, &locked);
          ...
          unlocked_inode_to_wb_end(inode, locked);
          unlock_page_memcg(page);
      
      If both inode switch and process memcg migration are both in-flight then
      unlocked_inode_to_wb_end() will unconditionally enable interrupts while
      still holding the lock_page_memcg() irq spinlock.  This suggests the
      possibility of deadlock if an interrupt occurs before unlock_page_memcg().
      
          truncate
          __cancel_dirty_page
          lock_page_memcg
          unlocked_inode_to_wb_begin
          unlocked_inode_to_wb_end
          <interrupts mistakenly enabled>
                                          <interrupt>
                                          end_page_writeback
                                          test_clear_page_writeback
                                          lock_page_memcg
                                          <deadlock>
          unlock_page_memcg
      
      Due to configuration limitations this deadlock is not currently possible
      because we don't mix cgroup writeback (a cgroupv2 feature) and
      memory.move_charge_at_immigrate (a cgroupv1 feature).
      
      If the kernel is hacked to always claim inode switching and memcg
      moving_account, then this script triggers lockup in less than a minute:
      
        cd /mnt/cgroup/memory
        mkdir a b
        echo 1 > a/memory.move_charge_at_immigrate
        echo 1 > b/memory.move_charge_at_immigrate
        (
          echo $BASHPID > a/cgroup.procs
          while true; do
            dd if=/dev/zero of=/mnt/big bs=1M count=256
          done
        ) &
        while true; do
          sync
        done &
        sleep 1h &
        SLEEP=$!
        while true; do
          echo $SLEEP > a/cgroup.procs
          echo $SLEEP > b/cgroup.procs
        done
      
      The deadlock does not seem possible, so it's debatable if there's any
      reason to modify the kernel.  I suggest we should to prevent future
      surprises.  And Wang Long said "this deadlock occurs three times in our
      environment", so there's more reason to apply this, even to stable.
      Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
      see "[PATCH for-4.4] writeback: safer lock nesting"
      https://lkml.org/lkml/2018/4/11/146
      
      Wang Long said "this deadlock occurs three times in our environment"
      
      [gthelen@google.com: v4]
        Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
      [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
      Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
      Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
      Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reported-by: default avatarWang Long <wanglong19@meituan.com>
      Acked-by: default avatarWang Long <wanglong19@meituan.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>	[v4.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e898e4c
  6. 11 Apr, 2018 1 commit
  7. 30 Nov, 2017 1 commit
    • Michal Hocko's avatar
      Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" · 90daf306
      Michal Hocko authored
      This reverts commit 0f6d24f8 ("mm/page-writeback.c: print a warning
      if the vm dirtiness settings are illogical") because it causes false
      positive warnings during OOM situations as noticed by Tetsuo Handa:
      
        Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
        Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 2687 3645 3645
        Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 0 958 958
        Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        [...]
        Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
        Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        vm direct limit must be set greater than background limit.
      
      The problem is that both thresh and bg_thresh will be 0 if
      available_memory is less than 4 pages when evaluating
      global_dirtyable_memory.
      
      While this might be worked around the whole point of the warning is
      dubious at best.  We do rely on admins to do sensible things when
      changing tunable knobs.  Dirty memory writeback knobs are not any
      special in that regards so revert the warning rather than adding more
      hacks to work this around.
      
      Debugged by Yafang Shao.
      
      Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
      Fixes: 0f6d24f8 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90daf306
  8. 21 Nov, 2017 1 commit
    • Kees Cook's avatar
      block/laptop_mode: Convert timers to use timer_setup() · bca237a5
      Kees Cook authored
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      bca237a5
  9. 16 Nov, 2017 8 commits
  10. 14 Oct, 2017 1 commit
  11. 09 Oct, 2017 1 commit
  12. 03 Oct, 2017 2 commits
  13. 07 Sep, 2017 1 commit
  14. 18 Aug, 2017 1 commit
    • Johannes Weiner's avatar
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner authored
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: default avatarBradley Bolen <bradleybolen@gmail.com>
      Tested-by: default avatarBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  15. 12 Jul, 2017 1 commit
  16. 06 Jul, 2017 1 commit
  17. 05 Jul, 2017 2 commits
  18. 03 May, 2017 3 commits
  19. 28 Apr, 2017 1 commit
    • Theodore Ts'o's avatar
      mm: retry writepages() on ENOMEM when doing an data integrity writeback · 80a2ea9f
      Theodore Ts'o authored
      Currently, file system's writepages() function must not fail with an
      ENOMEM, since if they do, it's possible for buffered data to be lost.
      This is because on a data integrity writeback writepages() gets called
      but once, and if it returns ENOMEM, if you're lucky the error will get
      reflected back to the userspace process calling fsync().  If you
      aren't lucky, the user is unmounting the file system, and the dirty
      pages will simply be lost.
      
      For this reason, file system code generally will use GFP_NOFS, and in
      some cases, will retry the allocation in a loop, on the theory that
      "kernel livelocks are temporary; data loss is forever".
      Unfortunately, this can indeed cause livelocks, since inside the
      writepages() call, the file system is holding various mutexes, and
      these mutexes may prevent the OOM killer from killing its targetted
      victim if it is also holding on to those mutexes.
      
      A better solution would be to allow writepages() to call the memory
      allocator with flags that give greater latitude to the allocator to
      fail, and then release its locks and return ENOMEM, and in the case of
      background writeback, the writes can be retried at a later time.  In
      the case of data-integrity writeback retry after waiting a brief
      amount of time.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      80a2ea9f
  20. 02 Mar, 2017 1 commit
  21. 28 Feb, 2017 1 commit
  22. 25 Feb, 2017 1 commit
  23. 02 Feb, 2017 1 commit
  24. 15 Dec, 2016 1 commit
  25. 08 Nov, 2016 1 commit
  26. 08 Oct, 2016 2 commits
    • Huang Ying's avatar
      mm: don't use radix tree writeback tags for pages in swap cache · 371a096e
      Huang Ying authored
      File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
      etc.) to accelerate finding the pages with a specific tag in the radix
      tree during inode writeback.  But for anonymous pages in the swap cache,
      there is no inode writeback.  So there is no need to find the pages with
      some writeback tags in the radix tree.  It is not necessary to touch
      radix tree writeback tags for pages in the swap cache.
      
      Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
      introduced for address spaces which don't need to update the writeback
      tags.  The flag is set for swap caches.  It may be used for DAX file
      systems, etc.
      
      With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
      ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
      The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
      simulated PMEM (persistent memory) device.  The improvement comes from
      the reduced contention on the swap cache radix tree lock.  To test
      sequential swapping out, the test case uses 8 processes, which
      sequentially allocate and write to the anonymous pages until RAM and
      part of the swap device is used up.
      
      Details of comparison is as follow,
      
      base             base+patch
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
         2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
         1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
           10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
           10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
           10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
           10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
      
      Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      371a096e
    • Michal Hocko's avatar
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko authored
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
  27. 06 Sep, 2016 1 commit
  28. 28 Jul, 2016 1 commit