1. 22 Jun, 2018 1 commit
    • Jan Kara's avatar
      bdi: Fix another oops in wb_workfn() · 3ee7e869
      Jan Kara authored
      syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
      wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
      WB_shutting_down after wb->bdi->dev became NULL. This indicates that
      unregister_bdi() failed to call wb_shutdown() on one of wb objects.
      
      The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
      drops bdi's reference to wb structures before going through the list of
      wbs again and calling wb_shutdown() on each of them. This way the loop
      iterating through all wbs can easily miss a wb if that wb has already
      passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
      from cgwb_release_workfn() and as a result fully shutdown bdi although
      wb_workfn() for this wb structure is still running. In fact there are
      also other ways cgwb_bdi_unregister() can race with
      cgwb_release_workfn() leading e.g. to use-after-free issues:
      
      CPU1                            CPU2
                                      cgwb_bdi_unregister()
                                        cgwb_kill(*slot);
      
      cgwb_release()
        queue_work(cgwb_release_wq, &wb->release_work);
      cgwb_release_workfn()
                                        wb = list_first_entry(&bdi->wb_list, ...)
                                        spin_unlock_irq(&cgwb_lock);
        wb_shutdown(wb);
        ...
        kfree_rcu(wb, rcu);
                                        wb_shutdown(wb); -> oops use-after-free
      
      We solve these issues by synchronizing writeback structure shutdown from
      cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
      way we also no longer need synchronization using WB_shutting_down as the
      mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
      CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
      bdi_unregister().
      Reported-by: 's avatarsyzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      3ee7e869
  2. 08 Jun, 2018 1 commit
  3. 23 May, 2018 1 commit
    • Tejun Heo's avatar
      bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue · f1834646
      Tejun Heo authored
      From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Wed, 23 May 2018 10:29:00 -0700
      
      cgwb_release() punts the actual release to cgwb_release_workfn() on
      system_wq.  Depending on the number of cgroups or block devices, there
      can be a lot of cgwb_release_workfn() in flight at the same time.
      
      We're periodically seeing close to 256 kworkers getting stuck with the
      following stack trace and overtime the entire system gets stuck.
      
        [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
        [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
        [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
        [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
        [<ffffffff811cd645>] wb_exit+0x85/0xa0
        [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
        [<ffffffff810a68d0>] process_one_work+0x150/0x410
        [<ffffffff810a71fd>] worker_thread+0x6d/0x520
        [<ffffffff810ad3dc>] kthread+0x12c/0x160
        [<ffffffff81969019>] ret_from_fork+0x29/0x40
        [<ffffffffffffffff>] 0xffffffffffffffff
      
      The events leading to the lockup are...
      
      1. A lot of cgwb_release_workfn() is queued at the same time and all
         system_wq kworkers are assigned to execute them.
      
      2. They all end up calling synchronize_rcu_expedited().  One of them
         wins and tries to perform the expedited synchronization.
      
      3. However, that invovles queueing rcu_exp_work to system_wq and
         waiting for it.  Because #1 is holding all available kworkers on
         system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
         is waiting for synchronize_rcu_expedited() which in turn is waiting
         for cgwb_release_workfn() to free up some of the kworkers.
      
      We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
      same time.  There's nothing to be gained from that.  This patch
      updates cgwb release path to use a dedicated percpu workqueue with
      @max_active of 1.
      
      While this resolves the problem at hand, it might be a good idea to
      isolate rcu_exp_work to its own workqueue too as it can be used from
      various paths and is prone to this sort of indirect A-A deadlocks.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      f1834646
  4. 03 May, 2018 2 commits
  5. 11 Apr, 2018 1 commit
    • Andrey Ryabinin's avatar
      mm/vmscan: don't mess with pgdat->flags in memcg reclaim · e3c1ac58
      Andrey Ryabinin authored
      memcg reclaim may alter pgdat->flags based on the state of LRU lists in
      cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
      congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
      pages.  But the worst here is PGDAT_CONGESTED, since it may force all
      direct reclaims to stall in wait_iff_congested().  Note that only kswapd
      have powers to clear any of these bits.  This might just never happen if
      cgroup limits configured that way.  So all direct reclaims will stall as
      long as we have some congested bdi in the system.
      
      Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
      pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
      it's reasonable to leave all decisions about node state to kswapd.
      
      Why only kswapd? Why not allow to global direct reclaim change these
      flags? It is because currently only kswapd can clear these flags.  I'm
      less worried about the case when PGDAT_CONGESTED falsely not set, and
      more worried about the case when it falsely set.  If direct reclaimer
      sets PGDAT_CONGESTED, do we have guarantee that after the congestion
      problem is sorted out, kswapd will be woken up and clear the flag? It
      seems like there is no such guarantee.  E.g.  direct reclaimers may
      eventually balance pgdat and kswapd simply won't wake up (see
      wakeup_kswapd()).
      
      Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
      now loses its congestion throttling mechanism.  Add per-cgroup
      congestion state and throttle cgroup2 reclaimers if memcg is in
      congestion state.
      
      Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
      bits since they alter only kswapd behavior.
      
      The problem could be easily demonstrated by creating heavy congestion in
      one cgroup:
      
          echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
          mkdir -p /sys/fs/cgroup/congester
          echo 512M > /sys/fs/cgroup/congester/memory.max
          echo $$ > /sys/fs/cgroup/congester/cgroup.procs
          /* generate a lot of diry data on slow HDD */
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
          ....
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
      
      and some job in another cgroup:
      
          mkdir /sys/fs/cgroup/victim
          echo 128M > /sys/fs/cgroup/victim/memory.max
      
          # time cat /dev/sda > /dev/null
          real    10m15.054s
          user    0m0.487s
          sys     1m8.505s
      
      According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
      of the time sleeping there.
      
      With the patch, cat don't waste time anymore:
      
          # time cat /dev/sda > /dev/null
          real    5m32.911s
          user    0m0.411s
          sys     0m56.664s
      
      [aryabinin@virtuozzo.com: congestion state should be per-node]
        Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
      [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
        Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.comSigned-off-by: 's avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: 's avatarShakeel Butt <shakeelb@google.com>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3c1ac58
  6. 06 Apr, 2018 1 commit
  7. 28 Feb, 2018 1 commit
  8. 21 Dec, 2017 1 commit
  9. 19 Nov, 2017 2 commits
  10. 06 Oct, 2017 1 commit
  11. 11 Sep, 2017 1 commit
  12. 20 Apr, 2017 5 commits
  13. 23 Mar, 2017 6 commits
    • Jan Kara's avatar
      bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() · b1c51afc
      Jan Kara authored
      Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
      from bdi_unregister() which is not necessarily called from bdi_destroy()
      and thus the name is somewhat misleading.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      b1c51afc
    • Jan Kara's avatar
      bdi: Do not wait for cgwbs release in bdi_unregister() · 4514451e
      Jan Kara authored
      Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
      (called from bdi_unregister()). That is however unnecessary now when
      cgwb->bdi is a proper refcounted reference (thus bdi cannot get
      released before all cgwbs are released) and when cgwb_bdi_destroy()
      shuts down writeback directly.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      4514451e
    • Jan Kara's avatar
      bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() · 5318ce7d
      Jan Kara authored
      Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
      which also means that writeback has been shutdown on them. Since this
      wait is going away, directly shutdown writeback on cgwbs from
      cgwb_bdi_destroy() to avoid live writeback structures after
      bdi_unregister() has finished. To make that safe with concurrent
      shutdown from cgwb_release_workfn(), we also have to make sure
      wb_shutdown() returns only after the bdi_writeback structure is really
      shutdown.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      5318ce7d
    • Jan Kara's avatar
      bdi: Unify bdi->wb_list handling for root wb_writeback · e8cb72b3
      Jan Kara authored
      Currently root wb_writeback structure is added to bdi->wb_list in
      bdi_init() and never removed. That is different from all other
      wb_writeback structures which get added to the list when created and
      removed from it before wb_shutdown().
      
      So move list addition of root bdi_writeback to bdi_register() and list
      removal of all wb_writeback structures to wb_shutdown(). That way a
      wb_writeback structure is on bdi->wb_list if and only if it can handle
      writeback and it will make it easier for us to handle shutdown of all
      wb_writeback structures in bdi_unregister().
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      e8cb72b3
    • Jan Kara's avatar
      bdi: Make wb->bdi a proper reference · 810df54a
      Jan Kara authored
      Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
      structures except for the one embedded inside struct backing_dev_info.
      That will allow us to simplify bdi unregistration.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      810df54a
    • Jan Kara's avatar
      bdi: Mark congested->bdi as internal · b7d680d7
      Jan Kara authored
      congested->bdi pointer is used only to be able to remove congested
      structure from bdi->cgwb_congested_tree on structure release. Moreover
      the pointer can become NULL when we unregister the bdi. Rename the field
      to __bdi and add a comment to make it more explicit this is internal
      stuff of memcg writeback code and people should not use the field as
      such use will be likely race prone.
      
      We do not bother with converting congested->bdi to a proper refcounted
      reference. It will be slightly ugly to special-case bdi->wb.congested to
      avoid effectively a cyclic reference of bdi to itself and the reference
      gets cleared from bdi_unregister() making it impossible to reference
      a freed bdi.
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      b7d680d7
  14. 08 Mar, 2017 2 commits
    • Jan Kara's avatar
      bdi: Fix use-after-free in wb_congested_put() · df23de55
      Jan Kara authored
      bdi_writeback_congested structures get created for each blkcg and bdi
      regardless whether bdi is registered or not. When they are created in
      unregistered bdi and the request queue (and thus bdi) is then destroyed
      while blkg still holds reference to bdi_writeback_congested structure,
      this structure will be referencing freed bdi and last wb_congested_put()
      will try to remove the structure from already freed bdi.
      
      With commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()", SCSI started to destroy bdis without calling
      bdi_unregister() first (previously it was calling bdi_unregister() even
      for unregistered bdis) and thus the code detaching
      bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
      started hitting this use-after-free bug. It is enough to boot a KVM
      instance with virtio-scsi device to trigger this behavior.
      
      Fix the problem by detaching bdi_writeback_congested structures in
      bdi_exit() instead of bdi_unregister(). This is also more logical as
      they can get attached to bdi regardless whether it ever got registered
      or not.
      
      Fixes: 165a5e22Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Tested-by: 's avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      df23de55
    • Jan Kara's avatar
      block: Allow bdi re-registration · b6f8fec4
      Jan Kara authored
      SCSI can call device_add_disk() several times for one request queue when
      a device in unbound and bound, creating new gendisk each time. This will
      lead to bdi being repeatedly registered and unregistered. This was not a
      big problem until commit 165a5e22 "block: Move bdi_unregister() to
      del_gendisk()" since bdi was only registered repeatedly (bdi_register()
      handles repeated calls fine, only we ended up leaking reference to
      gendisk due to overwriting bdi->owner) but unregistered only in
      blk_cleanup_queue() which didn't get called repeatedly. After
      165a5e22 we were doing correct bdi_register() - bdi_unregister()
      cycles however bdi_unregister() is not prepared for it. So make sure
      bdi_unregister() cleans up bdi in such a way that it is prepared for
      a possible following bdi_register() call.
      
      An easy way to provoke this behavior is to enable
      CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
      scsi disk which immediately hangs without this fix.
      
      Fixes: 165a5e22Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Tested-by: 's avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      b6f8fec4
  15. 23 Feb, 2017 1 commit
  16. 08 Feb, 2017 1 commit
    • Tejun Heo's avatar
      block: fix double-free in the failure path of cgwb_bdi_init() · 5f478e4e
      Tejun Heo authored
      When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
      at bdi->wb_congested.  cgwb_bdi_init() allocates it with kzalloc() and
      doesn't do further initialization.  This usually works fine as the
      reference count gets bumped to 1 by wb_init() and the put from
      wb_exit() releases it.
      
      However, when wb_init() fails, it puts the wb base ref automatically
      freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
      ends up trying to free the same pointer the second time causing a
      double-free.
      
      Fix it by explicitly initilizing the refcnt to 1 and putting the base
      ref from cgwb_bdi_destroy().
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Reported-by: 's avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      5f478e4e
  17. 02 Feb, 2017 1 commit
  18. 08 Nov, 2016 1 commit
  19. 04 Aug, 2016 1 commit
    • Dan Williams's avatar
      block: fix bdi vs gendisk lifetime mismatch · df08c32c
      Dan Williams authored
      The name for a bdi of a gendisk is derived from the gendisk's devt.
      However, since the gendisk is destroyed before the bdi it leaves a
      window where a new gendisk could dynamically reuse the same devt while a
      bdi with the same name is still live.  Arrange for the bdi to hold a
      reference against its "owner" disk device while it is registered.
      Otherwise we can hit sysfs duplicate name collisions like the following:
      
       WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
       sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'
      
       Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
        0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
        ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
        0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
       Call Trace:
        [<ffffffff8134caec>] dump_stack+0x63/0x87
        [<ffffffff8108c351>] __warn+0xd1/0xf0
        [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
        [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
        [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
        [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
        [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
        [<ffffffff8134ff55>] kobject_add+0x75/0xd0
        [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
        [<ffffffff8148b0a5>] device_add+0x125/0x610
        [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
        [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
        [<ffffffff811b775c>] bdi_register+0x8c/0x180
        [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
        [<ffffffff813317f5>] add_disk+0x175/0x4a0
      
      Cc: <stable@vger.kernel.org>
      Reported-by: 's avatarYi Zhang <yizhan@redhat.com>
      Tested-by: 's avatarYi Zhang <yizhan@redhat.com>
      Signed-off-by: 's avatarDan Williams <dan.j.williams@intel.com>
      
      Fixed up missing 0 return in bdi_register_owner().
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      df08c32c
  20. 28 Jul, 2016 1 commit
    • Mel Gorman's avatar
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman authored
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: 's avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
  21. 21 May, 2016 1 commit
    • Michal Hocko's avatar
      mm: throttle on IO only when there are too many dirty and writeback pages · ede37713
      Michal Hocko authored
      wait_iff_congested has been used to throttle allocator before it retried
      another round of direct reclaim to allow the writeback to make some
      progress and prevent reclaim from looping over dirty/writeback pages
      without making any progress.
      
      We used to do congestion_wait before commit 0e093d99 ("writeback: do
      not sleep on the congestion queue if there are no congested BDIs or if
      significant congestion is not being encountered in the current zone")
      but that led to undesirable stalls and sleeping for the full timeout
      even when the BDI wasn't congested.  Hence wait_iff_congested was used
      instead.
      
      But it seems that even wait_iff_congested doesn't work as expected.  We
      might have a small file LRU list with all pages dirty/writeback and yet
      the bdi is not congested so this is just a cond_resched in the end and
      can end up triggering pre mature OOM.
      
      This patch replaces the unconditional wait_iff_congested by
      congestion_wait which is executed only if we _know_ that the last round
      of direct reclaim didn't make any progress and dirty+writeback pages are
      more than a half of the reclaimable pages on the zone which might be
      usable for our target allocation.  This shouldn't reintroduce stalls
      fixed by 0e093d99 because congestion_wait is called only when we are
      getting hopeless when sleeping is a better choice than OOM with many
      pages under IO.
      
      We have to preserve logic introduced by commit 373ccbe5 ("mm,
      vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
      progress") into the __alloc_pages_slowpath now that wait_iff_congested
      is not used anymore.  As the only remaining user of wait_iff_congested
      is shrink_inactive_list we can remove the WQ specific short sleep from
      wait_iff_congested because the sleep is needed to be done only once in
      the allocation retry cycle.
      
      [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
       Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ede37713
  22. 31 Mar, 2016 1 commit
  23. 17 Mar, 2016 1 commit
  24. 12 Feb, 2016 1 commit
  25. 06 Feb, 2016 1 commit
    • Tetsuo Handa's avatar
      mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress · 564e81a5
      Tetsuo Handa authored
      Jan Stancek has reported that system occasionally hanging after "oom01"
      testcase from LTP triggers OOM.  Guessing from a result that there is a
      kworker thread doing memory allocation and the values between "Node 0
      Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
      up-to-date for some reason.
      
      According to commit 373ccbe5 ("mm, vmstat: allow WQ concurrency to
      discover memory reclaim doesn't make any progress"), it meant to force
      the kworker thread to take a short sleep, but it by error used
      schedule_timeout(1).  We missed that schedule_timeout() in state
      TASK_RUNNING doesn't do anything.
      
      Fix it by using schedule_timeout_uninterruptible(1) which forces the
      kworker thread to take a short sleep in order to make sure that vmstat
      is up-to-date.
      
      Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
      Signed-off-by: 's avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: 's avatarJan Stancek <jstancek@redhat.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      564e81a5
  26. 15 Jan, 2016 1 commit
  27. 12 Dec, 2015 1 commit
    • Michal Hocko's avatar
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 373ccbe5
      Michal Hocko authored
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reported-by: 's avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      373ccbe5
  28. 07 Nov, 2015 1 commit
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: 's avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc