1. 04 Sep, 2019 1 commit
  2. 31 Aug, 2019 6 commits
  3. 25 Aug, 2019 8 commits
    • Andrey Ryabinin's avatar
      mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y · 00fb24a4
      Andrey Ryabinin authored
      The code like this:
      
      	ptr = kmalloc(size, GFP_KERNEL);
      	page = virt_to_page(ptr);
      	offset = offset_in_page(ptr);
      	kfree(page_address(page) + offset);
      
      may produce false-positive invalid-free reports on the kernel with
      CONFIG_KASAN_SW_TAGS=y.
      
      In the example above we lose the original tag assigned to 'ptr', so
      kfree() gets the pointer with 0xFF tag.  In kfree() we check that 0xFF
      tag is different from the tag in shadow hence print false report.
      
      Instead of just comparing tags, do the following:
      
      1) Check that shadow doesn't contain KASAN_TAG_INVALID.  Otherwise it's
         double-free and it doesn't matter what tag the pointer have.
      
      2) If pointer tag is different from 0xFF, make sure that tag in the
         shadow is the same as in the pointer.
      
      Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com
      Fixes: 7f94ffbc ("kasan: add hooks implementation for tag-based mode")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00fb24a4
    • Henry Burns's avatar
      mm/zsmalloc.c: fix race condition in zs_destroy_pool · 701d6785
      Henry Burns authored
      In zs_destroy_pool() we call flush_work(&pool->free_work).  However, we
      have no guarantee that migration isn't happening in the background at
      that time.
      
      Since migration can't directly free pages, it relies on free_work being
      scheduled to free the pages.  But there's nothing preventing an
      in-progress migrate from queuing the work *after*
      zs_unregister_migration() has called flush_work().  Which would mean
      pages still pointing at the inode when we free it.
      
      Since we know at destroy time all objects should be free, no new
      migrations can come in (since zs_page_isolate() fails for fully-free
      zspages).  This means it is sufficient to track a "# isolated zspages"
      count by class, and have the destroy logic ensure all such pages have
      drained before proceeding.  Keeping that state under the class spinlock
      keeps the logic straightforward.
      
      In this case a memory leak could lead to an eventual crash if compaction
      hits the leaked page.  This crash would only occur if people are
      changing their zswap backend at runtime (which eventually starts
      destruction).
      
      Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com
      Fixes: 48b4800a ("zsmalloc: page migration support")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Jonathan Adams <jwadams@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      701d6785
    • Henry Burns's avatar
      mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely · 1a87aa03
      Henry Burns authored
      In zs_page_migrate() we call putback_zspage() after we have finished
      migrating all pages in this zspage.  However, the return value is
      ignored.  If a zs_free() races in between zs_page_isolate() and
      zs_page_migrate(), freeing the last object in the zspage,
      putback_zspage() will leave the page in ZS_EMPTY for potentially an
      unbounded amount of time.
      
      To fix this, we need to do the same thing as zs_page_putback() does:
      schedule free_work to occur.
      
      To avoid duplicated code, move the sequence to a new
      putback_zspage_deferred() function which both zs_page_migrate() and
      zs_page_putback() call.
      
      Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com
      Fixes: 48b4800a ("zsmalloc: page migration support")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Jonathan Adams <jwadams@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a87aa03
    • Vlastimil Babka's avatar
      mm, page_owner: handle THP splits correctly · f7da677b
      Vlastimil Babka authored
      THP splitting path is missing the split_page_owner() call that
      split_page() has.
      
      As a result, split THP pages are wrongly reported in the page_owner file
      as order-9 pages.  Furthermore when the former head page is freed, the
      remaining former tail pages are not listed in the page_owner file at
      all.  This patch fixes that by adding the split_page_owner() call into
      __split_huge_page().
      
      Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
      Fixes: a9627bc5 ("mm/page_owner: introduce split_page_owner and replace manual handling")
      Reported-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7da677b
    • Roman Gushchin's avatar
      mm: memcontrol: flush percpu vmevents before releasing memcg · bb65f89b
      Roman Gushchin authored
      Similar to vmstats, percpu caching of local vmevents leads to an
      accumulation of errors on non-leaf levels.  This happens because some
      leftovers may remain in percpu caches, so that they are never propagated
      up by the cgroup tree and just disappear into nonexistence with on
      releasing of the memory cgroup.
      
      To fix this issue let's accumulate and propagate percpu vmevents values
      before releasing the memory cgroup similar to what we're doing with
      vmstats.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb65f89b
    • Roman Gushchin's avatar
      mm: memcontrol: flush percpu vmstats before releasing memcg · c350a99e
      Roman Gushchin authored
      Percpu caching of local vmstats with the conditional propagation by the
      cgroup tree leads to an accumulation of errors on non-leaf levels.
      
      Let's imagine two nested memory cgroups A and A/B.  Say, a process
      belonging to A/B allocates 100 pagecache pages on the CPU 0.  The percpu
      cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
      and A atomic vmstat counters, 4 pages will remain in the percpu cache.
      
      Imagine A/B is nearby memory.max, so that every following allocation
      triggers a direct reclaim on the local CPU.  Say, each such attempt will
      free 16 pages on a new cpu.  That means every percpu cache will have -16
      pages, except the first one, which will have 4 - 16 = -12.  A/B and A
      atomic counters will not be touched at all.
      
      Now a user removes A/B.  All percpu caches are freed and corresponding
      vmstat numbers are forgotten.  A has 96 pages more than expected.
      
      As memory cgroups are created and destroyed, errors do accumulate.  Even
      1-2 pages differences can accumulate into large numbers.
      
      To fix this issue let's accumulate and propagate percpu vmstat values
      before releasing the memory cgroup.  At this point these numbers are
      stable and cannot be changed.
      
      Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
      only over online cpus.
      
      Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
      Fixes: 42a30035 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c350a99e
    • David Rientjes's avatar
      mm, page_alloc: move_freepages should not examine struct page of reserved memory · cd961038
      David Rientjes authored
      After commit 907ec5fc ("mm: zero remaining unavailable struct
      pages"), struct page of reserved memory is zeroed.  This causes
      page->flags to be 0 and fixes issues related to reading
      /proc/kpageflags, for example, of reserved memory.
      
      The VM_BUG_ON() in move_freepages_block(), however, assumes that
      page_zone() is meaningful even for reserved memory.  That assumption is
      no longer true after the aforementioned commit.
      
      There's no reason why move_freepages_block() should be testing the
      legitimacy of page_zone() for reserved memory; its scope is limited only
      to pages on the zone's freelist.
      
      Note that pfn_valid() can be true for reserved memory: there is a
      backing struct page.  The check for page_to_nid(page) is also buggy but
      reserved memory normally only appears on node 0 so the zeroing doesn't
      affect this.
      
      Move the debug checks to after verifying PageBuddy is true.  This
      isolates the scope of the checks to only be for buddy pages which are on
      the zone's freelist which move_freepages_block() is operating on.  In
      this case, an incorrect node or zone is a bug worthy of being warned
      about (and the examination of struct page is acceptable bcause this
      memory is not reserved).
      
      Why does move_freepages_block() gets called on reserved memory? It's
      simply math after finding a valid free page from the per-zone free area
      to use as fallback.  We find the beginning and end of the pageblock of
      the valid page and that can bring us into memory that was reserved per
      the e820.  pfn_valid() is still true (it's backed by a struct page), but
      since it's zero'd we shouldn't make any inferences here about comparing
      its node or zone.  The current node check just happens to succeed most
      of the time by luck because reserved memory typically appears on node 0.
      
      The fix here is to validate that we actually have buddy pages before
      testing if there's any type of zone or node strangeness going on.
      
      We noticed it almost immediately after bringing 907ec5fc in on
      CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
      the per-zone free area where the math in move_freepages() will bring the
      start or end pfn into reserved memory and wanting to claim that entire
      pageblock as a new migratetype.  So the path will be rare, require
      CONFIG_DEBUG_VM, and require fallback to a different migratetype.
      
      Some struct pages were already zeroed from reserve pages before
      907ec5fca3c so it theoretically could trigger before this commit.  I
      think it's rare enough under a config option that most people don't run
      that others may not have noticed.  I wouldn't argue against a stable tag
      and the backport should be easy enough, but probably wouldn't single out
      a commit that this is fixing.
      
      Mel said:
      
      : The overhead of the debugging check is higher with this patch although
      : it'll only affect debug builds and the path is not particularly hot.
      : If this was a concern, I think it would be reasonable to simply remove
      : the debugging check as the zone boundaries are checked in
      : move_freepages_block and we never expect a zone/node to be smaller than
      : a pageblock and stuck in the middle of another zone.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd961038
    • Henry Burns's avatar
      mm/z3fold.c: fix race between migration and destruction · d776aaa9
      Henry Burns authored
      In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq).
      However, we have no guarantee that migration isn't happening in the
      background at that time.
      
      Migration directly calls queue_work_on(pool->compact_wq), if destruction
      wins that race we are using a destroyed workqueue.
      
      Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.comSigned-off-by: default avatarHenry Burns <henryburns@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Jonathan Adams <jwadams@google.com>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d776aaa9
  4. 13 Aug, 2019 15 commits
    • Mike Kravetz's avatar
      hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS · 4643d67e
      Mike Kravetz authored
      Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
      the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
      page migration and page fault.
      
      If a hugetlb page can not be allocated to satisfy a page fault, the task
      is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
      mutex exists to prevent two tasks from trying to instantiate the same
      page.  This protects against the situation where there is only one
      hugetlb page, and both tasks would try to allocate.  Without the mutex,
      one would fail and SIGBUS even though the other fault would be
      successful.
      
      There is a similar race between hugetlb page migration and fault.
      Migration code will allocate a page for the target of the migration.  It
      will then unmap the original page from all page tables.  It does this
      unmap by first clearing the pte and then writing a migration entry.  The
      page table lock is held for the duration of this clear and write
      operation.  However, the beginnings of the hugetlb page fault code
      optimistically checks the pte without taking the page table lock.  If
      clear (as it can be during the migration unmap operation), a hugetlb
      page allocation is attempted to satisfy the fault.  Note that the page
      which will eventually satisfy this fault was already allocated by the
      migration code.  However, the allocation within the fault path could
      fail which would result in the task incorrectly being sent SIGBUS.
      
      Ideally, we could take the hugetlb fault mutex in the migration code
      when modifying the page tables.  However, locks must be taken in the
      order of hugetlb fault mutex, page lock, page table lock.  This would
      require significant rework of the migration code.  Instead, the issue is
      addressed in the hugetlb fault code.  After failing to allocate a huge
      page, take the page table lock and check for huge_pte_none before
      returning an error.  This is the same check that must be made further in
      the code even if page allocation is successful.
      
      Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarLi Wang <liwang@redhat.com>
      Tested-by: default avatarLi Wang <liwang@redhat.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4643d67e
    • Mel Gorman's avatar
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · 28360f39
      Mel Gorman authored
      Dave Chinner reported a problem pointing a finger at commit 1c30844d
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28360f39
    • Andrea Arcangeli's avatar
      Revert "mm, thp: restore node-local hugepage allocations" · a8282608
      Andrea Arcangeli authored
      This reverts commit 2f0799a0 ("mm, thp: restore node-local
      hugepage allocations").
      
      commit 2f0799a0 was rightfully applied to avoid the risk of a
      severe regression that was reported by the kernel test robot at the end
      of the merge window.  Now we understood the regression was a false
      positive and was caused by a significant increase in fairness during a
      swap trashing benchmark.  So it's safe to re-apply the fix and continue
      improving the code from there.  The benchmark that reported the
      regression is very useful, but it provides a meaningful result only when
      there is no significant alteration in fairness during the workload.  The
      removal of __GFP_THISNODE increased fairness.
      
      __GFP_THISNODE cannot be used in the generic page faults path for new
      memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
      behavior significantly deviates from what the MPOL_DEFAULT semantics are
      supposed to be for THP and 4k allocations alike.
      
      Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
      set to "madvise") has never meant to provide an implicit MPOL_BIND on
      the "current" node the task is running on, causing swap storms and
      providing a much more aggressive behavior than even zone_reclaim_node =
      3.
      
      Any workload who could have benefited from __GFP_THISNODE has now to
      enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
      the zone_reclaim_mode behavior, but it only did so if THP was enabled:
      if THP was disabled, there would have been no chance to get any 4k page
      from the current node if the current node was full of pagecache, which
      further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
      MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
      semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
      must work exactly the same with MADV_HUGEPAGE set or not.
      
      The performance characteristic of memory depends on the hardware
      details.  The numbers below are obtained on Naples/EPYC architecture and
      the N/A projection extends them to show what we should aim for in the
      future as a good THP NUMA locality default.  The benchmark used
      exercises random memory seeks (note: the cost of the page faults is not
      part of the measurement).
      
        D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
        0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
      
      D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
      intra socket memory), D2 means distance two (i.e.  inter socket memory),
      etc...
      
      For the guest physical memory allocated by qemu and for guest mode
      kernel the performance characteristic of RAM is more complex and an
      ideal default could be:
      
        D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
        0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
      
      NOTE: the N/A are projections and haven't been measured yet, the
      measurement in this case is done on a 1950x with only two NUMA nodes.
      The THP case here means THP was used both in the host and in the guest.
      
      After applying this commit the THP NUMA locality order that we'll get
      out of MADV_HUGEPAGE is this:
      
        D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Before this commit it was:
      
        D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Even if we ignore the breakage of large workloads that can't fit in a
      single node that the __GFP_THISNODE implicit "current node" mbind
      caused, the THP NUMA locality order provided by __GFP_THISNODE was still
      not the one we shall aim for in the long term (i.e.  the first one at
      the top).
      
      After this commit is applied, we can introduce a new allocator multi
      order API and to replace those two alloc_pages_vmas calls in the page
      fault path, with a single multi order call:
      
              unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
              page = alloc_pages_multi_order(..., &order);
              if (!page)
              	goto out;
              if (!(order & (1 << 0))) {
              	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
              	/* THP fault */
              } else {
              	VM_WARN_ON(order != 1 << 0);
              	/* 4k fallback */
              }
      
      The page allocator logic has to be altered so that when it fails on any
      zone with order 9, it has to try again with a order 0 before falling
      back to the next zone in the zonelist.
      
      After that we need to do more measurements and evaluate if adding an
      opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
      with "DN+1 THP | DN 4k" at every NUMA distance crossing.
      
      Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8282608
    • Andrea Arcangeli's avatar
      Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 92717d42
      Andrea Arcangeli authored
      Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
      
      The fixes for what was originally reported as "pathological THP
      behavior" we rightfully reverted to be sure not to introduced
      regressions at end of a merge window after a severe regression report
      from the kernel bot.  We can safely re-apply them now that we had time
      to analyze the problem.
      
      The mm process worked fine, because the good fixes were eventually
      committed upstream without excessive delay.
      
      The regression reported by the kernel bot however forced us to revert
      the good fixes to be sure not to introduce regressions and to give us
      the time to analyze the issue further.  The silver lining is that this
      extra time allowed to think more at this issue and also plan for a
      future direction to improve things further in terms of THP NUMA
      locality.
      
      This patch (of 2):
      
      This reverts commit 356ff8a9 ("Revert "mm, thp: consolidate THP
      gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
      89c83fb5 ("mm, thp: consolidate THP gfp handling into
      alloc_hugepage_direct_gfpmask").
      
      Consolidation of the THP allocation flags at the same place was meant to
      be a clean up to easier handle otherwise scattered code which is
      imposing a maintenance burden.  There were no real problems observed
      with the gfp mask consolidation but the reversion was rushed through
      without a larger consensus regardless.
      
      This patch brings the consolidation back because this should make the
      long term maintainability easier as well as it should allow future
      changes to be less error prone.
      
      [mhocko@kernel.org: changelog additions]
      Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92717d42
    • Roman Gushchin's avatar
      mm: workingset: fix vmstat counters for shadow nodes · ec9f0238
      Roman Gushchin authored
      Memcg counters for shadow nodes are broken because the memcg pointer is
      obtained in a wrong way. The following approach is used:
              virt_to_page(xa_node)->mem_cgroup
      
      Since commit 4d96ba35 ("mm: memcg/slab: stop setting
      page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
      set for slab pages, so memcg_from_slab_page() should be used instead.
      
      Also I doubt that it ever worked correctly: virt_to_head_page() should
      be used instead of virt_to_page().  Otherwise objects residing on tail
      pages are not accounted, because only the head page contains a valid
      mem_cgroup pointer.  That was a case since the introduction of these
      counters by the commit 68d48e6a ("mm: workingset: add vmstat counter
      for shadow nodes").
      
      Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9f0238
    • Isaac J. Manjarres's avatar
      mm/usercopy: use memory range to be accessed for wraparound check · 95153169
      Isaac J. Manjarres authored
      Currently, when checking to see if accessing n bytes starting at address
      "ptr" will cause a wraparound in the memory addresses, the check in
      check_bogus_address() adds an extra byte, which is incorrect, as the
      range of addresses that will be accessed is [ptr, ptr + (n - 1)].
      
      This can lead to incorrectly detecting a wraparound in the memory
      address, when trying to read 4 KB from memory that is mapped to the the
      last possible page in the virtual address space, when in fact, accessing
      that range of memory would not cause a wraparound to occur.
      
      Use the memory range that will actually be accessed when considering if
      accessing a certain amount of bytes will cause the memory address to
      wrap around.
      
      Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Signed-off-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: default avatarIsaac J. Manjarres <isaacm@codeaurora.org>
      Co-developed-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Trilok Soni <tsoni@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95153169
    • Catalin Marinas's avatar
      mm: kmemleak: disable early logging in case of error · fcf3a5b6
      Catalin Marinas authored
      If an error occurs during kmemleak_init() (e.g.  kmem cache cannot be
      created), kmemleak is disabled but kmemleak_early_log remains enabled.
      Subsequently, when the .init.text section is freed, the log_early()
      function no longer exists.  To avoid a page fault in such scenario,
      ensure that kmemleak_disable() also disables early logging.
      
      Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.comSigned-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf3a5b6
    • Kuppuswamy Sathyanarayanan's avatar
      mm/vmalloc.c: fix percpu free VM area search criteria · 5336e52c
      Kuppuswamy Sathyanarayanan authored
      Recent changes to the vmalloc code by commit 68ad4a33
      ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
      cause spurious percpu allocation failures.  These, in turn, can result
      in panic()s in the slub code.  One such possible panic was reported by
      Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
      Another related panic observed is,
      
       RIP: 0033:0x7f46f7441b9b
       Call Trace:
        dump_stack+0x61/0x80
        pcpu_alloc.cold.30+0x22/0x4f
        mem_cgroup_css_alloc+0x110/0x650
        cgroup_apply_control_enable+0x133/0x330
        cgroup_mkdir+0x41b/0x500
        kernfs_iop_mkdir+0x5a/0x90
        vfs_mkdir+0x102/0x1b0
        do_mkdirat+0x7d/0xf0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
      to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
      uses two lists (vmap_area_list & free_vmap_area_list) to track the used
      and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
      sizes[], nr_vms, align) function is used for allocating congruent VM
      areas for percpu memory allocator.  In order to not conflict with
      VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
      VMALLOC space.  So the search for free vm_area for the given requirement
      starts near VMALLOC_END and moves upwards towards VMALLOC_START.
      
      Prior to commit 68ad4a33, the search for free vm_area in
      pcpu_get_vm_areas() involves following two main steps.
      
      Step 1:
          Find a aligned "base" adress near VMALLOC_END.
          va = free vm area near VMALLOC_END
      Step 2:
          Loop through number of requested vm_areas and check,
              Step 2.1:
                 if (base < VMALLOC_START)
                    1. fail with error
              Step 2.2:
                 // end is offsets[area] + sizes[area]
                 if (base + end > va->vm_end)
                     1. Move the base downwards and repeat Step 2
              Step 2.3:
                 if (base + start < va->vm_start)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below:
      
              Step 2.3:
                 if (base + start < va->vm_start || base + end > va->vm_end)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      Above change is the root cause of spurious percpu memory allocation
      failures.  For example, consider a case where a relatively large vm_area
      (~ 30 TB) was ignored in free vm_area search because it did not pass the
      base + end < vm->vm_end boundary check.  Ignoring such large free
      vm_area's would lead to not finding free vm_area within boundary of
      VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
      
      So modify the search algorithm to include Step 2.2.
      
      Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
      Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5336e52c
    • Miles Chen's avatar
      mm/memcontrol.c: fix use after free in mem_cgroup_iter() · 54a83d6b
      Miles Chen authored
      This patch is sent to report an use after free in mem_cgroup_iter()
      after merging commit be2657752e9e ("mm: memcg: fix use after free in
      mem_cgroup_iter()").
      
      I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
      ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
      to the trees.  However, I can still observe use after free issues
      addressed in the commit be2657752e9e.  (on low-end devices, a few times
      this month)
      
      backtrace:
              css_tryget <- crash here
              mem_cgroup_iter
              shrink_node
              shrink_zones
              do_try_to_free_pages
              try_to_free_pages
              __perform_reclaim
              __alloc_pages_direct_reclaim
              __alloc_pages_slowpath
              __alloc_pages_nodemask
      
      To debug, I poisoned mem_cgroup before freeing it:
      
        static void __mem_cgroup_free(struct mem_cgroup *memcg)
              for_each_node(node)
              free_mem_cgroup_per_node_info(memcg, node);
              free_percpu(memcg->stat);
        +     /* poison memcg before freeing it */
        +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
              kfree(memcg);
        }
      
      The coredump shows the position=0xdbbc2a00 is freed.
      
        (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
        $13 = {position = 0xdbbc2a00, generation = 0x2efd}
      
        0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
        0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
        0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
        0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
        0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
        0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
        0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
        0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
        0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
        0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
        0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
        0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878
      
      In the reclaim path, try_to_free_pages() does not setup
      sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
      shrink_node().
      
      In mem_cgroup_iter(), root is set to root_mem_cgroup because
      sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
      root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
      
              try_to_free_pages
              	struct scan_control sc = {...}, target_mem_cgroup is 0x0;
              do_try_to_free_pages
              shrink_zones
              shrink_node
              	 mem_cgroup *root = sc->target_mem_cgroup;
              	 memcg = mem_cgroup_iter(root, NULL, &reclaim);
              mem_cgroup_iter()
              	if (!root)
              		root = root_mem_cgroup;
              	...
      
              	css = css_next_descendant_pre(css, &root->css);
              	memcg = mem_cgroup_from_css(css);
              	cmpxchg(&iter->position, pos, memcg);
      
      My device uses memcg non-hierarchical mode.  When we release a memcg:
      invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
      If non-hierarchical mode is used, invalidate_reclaim_iterators() never
      reaches root_mem_cgroup.
      
        static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
        {
              struct mem_cgroup *memcg = dead_memcg;
      
              for (; memcg; memcg = parent_mem_cgroup(memcg)
              ...
        }
      
      So the use after free scenario looks like:
      
        CPU1						CPU2
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            css = css_next_descendant_pre(css, &root->css);
            memcg = mem_cgroup_from_css(css);
            cmpxchg(&iter->position, pos, memcg);
      
              				invalidate_reclaim_iterators(memcg);
              				...
              				__mem_cgroup_free()
              					kfree(memcg);
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
            iter = &mz->iter[reclaim->priority];
            pos = READ_ONCE(iter->position);
            css_tryget(&pos->css) <- use after free
      
      To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
      in invalidate_reclaim_iterators().
      
      [cai@lca.pw: fix -Wparentheses compilation warning]
        Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a83d6b
    • Henry Burns's avatar
      mm/z3fold.c: fix z3fold_destroy_pool() race condition · b997052b
      Henry Burns authored
      The constraint from the zpool use of z3fold_destroy_pool() is there are
      no outstanding handles to memory (so no active allocations), but it is
      possible for there to be outstanding work on either of the two wqs in
      the pool.
      
      Calling z3fold_deregister_migration() before the workqueues are drained
      means that there can be allocated pages referencing a freed inode,
      causing any thread in compaction to be able to trip over the bad pointer
      in PageMovable().
      
      Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
      Fixes: 1f862989 ("mm/z3fold.c: support page migration")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJonathan Adams <jwadams@google.com>
      Cc: Vitaly Vul <vitaly.vul@sony.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b997052b
    • Henry Burns's avatar
      mm/z3fold.c: fix z3fold_destroy_pool() ordering · 6051d3bd
      Henry Burns authored
      The constraint from the zpool use of z3fold_destroy_pool() is there are
      no outstanding handles to memory (so no active allocations), but it is
      possible for there to be outstanding work on either of the two wqs in
      the pool.
      
      If there is work queued on pool->compact_workqueue when it is called,
      z3fold_destroy_pool() will do:
      
         z3fold_destroy_pool()
           destroy_workqueue(pool->release_wq)
           destroy_workqueue(pool->compact_wq)
             drain_workqueue(pool->compact_wq)
               do_compact_page(zhdr)
                 kref_put(&zhdr->refcount)
                   __release_z3fold_page(zhdr, ...)
                     queue_work_on(pool->release_wq, &pool->work) *BOOM*
      
      So compact_wq needs to be destroyed before release_wq.
      
      Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
      Fixes: 5d03a661 ("mm/z3fold.c: use kref to prevent page free/compact race")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJonathan Adams <jwadams@google.com>
      Cc: Vitaly Vul <vitaly.vul@sony.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6051d3bd
    • Yang Shi's avatar
      mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind · a53190a4
      Yang Shi authored
      When running syzkaller internally, we ran into the below bug on 4.9.x
      kernel:
      
        kernel BUG at mm/huge_memory.c:2124!
        invalid opcode: 0000 [#1] SMP KASAN
        CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
        task: ffff880067b34900 task.stack: ffff880068998000
        RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
        Call Trace:
          split_huge_page include/linux/huge_mm.h:100 [inline]
          queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
          walk_pmd_range mm/pagewalk.c:50 [inline]
          walk_pud_range mm/pagewalk.c:90 [inline]
          walk_pgd_range mm/pagewalk.c:116 [inline]
          __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
          walk_page_range+0x154/0x370 mm/pagewalk.c:285
          queue_pages_range+0x115/0x150 mm/mempolicy.c:694
          do_mbind mm/mempolicy.c:1241 [inline]
          SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
          SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
          do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
          entry_SYSCALL_64_after_swapgs+0x5d/0xdb
        Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
        RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
         RSP <ffff88006899f980>
      
      with the below test:
      
        uint64_t r[1] = {0xffffffffffffffff};
      
        int main(void)
        {
              syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                      intptr_t res = 0;
              res = syscall(__NR_socket, 0x11, 3, 0x300);
              if (res != -1)
                      r[0] = res;
              *(uint32_t*)0x20000040 = 0x10000;
              *(uint32_t*)0x20000044 = 1;
              *(uint32_t*)0x20000048 = 0xc520;
              *(uint32_t*)0x2000004c = 1;
              syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
              syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
              *(uint64_t*)0x20000340 = 2;
              syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
              return 0;
        }
      
      Actually the test does:
      
        mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
        socket(AF_PACKET, SOCK_RAW, 768)        = 3
        setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
        mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
        mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
      
      The setsockopt() would allocate compound pages (16 pages in this test)
      for packet tx ring, then the mmap() would call packet_mmap() to map the
      pages into the user address space specified by the mmap() call.
      
      When calling mbind(), it would scan the vma to queue the pages for
      migration to the new node.  It would split any huge page since 4.9
      doesn't support THP migration, however, the packet tx ring compound
      pages are not THP and even not movable.  So, the above bug is triggered.
      
      However, the later kernel is not hit by this issue due to commit
      d44d363f ("mm: don't assume anonymous pages have SwapBacked flag"),
      which just removes the PageSwapBacked check for a different reason.
      
      But, there is a deeper issue.  According to the semantic of mbind(), it
      should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
      MPOL_MF_STRICT was also specified, but the kernel was unable to move all
      existing pages in the range.  The tx ring of the packet socket is
      definitely not movable, however, mbind() returns success for this case.
      
      Although the most socket file associates with non-movable pages, but XDP
      may have movable pages from gup.  So, it sounds not fine to just check
      the underlying file type of vma in vma_migratable().
      
      Change migrate_page_add() to check if the page is movable or not, if it
      is unmovable, just return -EIO.  But do not abort pte walk immediately,
      since there may be pages off LRU temporarily.  We should migrate other
      pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
      paged could not be not moved, then return -EIO for mbind() eventually.
      
      With this change the above test would return -EIO as expected.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a53190a4
    • Yang Shi's avatar
      mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified · d8835445
      Yang Shi authored
      When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
      try best to migrate misplaced pages, if some of the pages could not be
      migrated, then return -EIO.
      
      There are three different sub-cases:
       1. vma is not migratable
       2. vma is migratable, but there are unmovable pages
       3. vma is migratable, pages are movable, but migrate_pages() fails
      
      If #1 happens, kernel would just abort immediately, then return -EIO,
      after a7f40cfe ("mm: mempolicy: make mbind() return -EIO when
      MPOL_MF_STRICT is specified").
      
      If #3 happens, kernel would set policy and migrate pages with
      best-effort, but won't rollback the migrated pages and reset the policy
      back.
      
      Before that commit, they behaves in the same way.  It'd better to keep
      their behavior consistent.  But, rolling back the migrated pages and
      resetting the policy back sounds not feasible, so just make #1 behave as
      same as #3.
      
      Userspace will know that not everything was successfully migrated (via
      -EIO), and can take whatever steps it deems necessary - attempt
      rollback, determine which exact page(s) are violating the policy, etc.
      
      Make queue_pages_range() return 1 to indicate there are unmovable pages
      or vma is not migratable.
      
      The #2 is not handled correctly in the current kernel, the following
      patch will fix it.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8835445
    • Ralph Campbell's avatar
      mm/hmm: fix bad subpage pointer in try_to_unmap_one · 1de13ee5
      Ralph Campbell authored
      When migrating an anonymous private page to a ZONE_DEVICE private page,
      the source page->mapping and page->index fields are copied to the
      destination ZONE_DEVICE struct page and the page_mapcount() is
      increased.  This is so rmap_walk() can be used to unmap and migrate the
      page back to system memory.
      
      However, try_to_unmap_one() computes the subpage pointer from a swap pte
      which computes an invalid page pointer and a kernel panic results such
      as:
      
        BUG: unable to handle page fault for address: ffffea1fffffffc8
      
      Currently, only single pages can be migrated to device private memory so
      no subpage computation is needed and it can be set to "page".
      
      [rcampbell@nvidia.com: add comment]
        Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1de13ee5
    • Ralph Campbell's avatar
      mm/hmm: fix ZONE_DEVICE anon page mapping reuse · 7ab0ad0e
      Ralph Campbell authored
      When a ZONE_DEVICE private page is freed, the page->mapping field can be
      set.  If this page is reused as an anonymous page, the previous value
      can prevent the page from being inserted into the CPU's anon rmap table.
      For example, when migrating a pte_none() page to device memory:
      
        migrate_vma(ops, vma, start, end, src, dst, private)
          migrate_vma_collect()
            src[] = MIGRATE_PFN_MIGRATE
          migrate_vma_prepare()
            /* no page to lock or isolate so OK */
          migrate_vma_unmap()
            /* no page to unmap so OK */
          ops->alloc_and_copy()
            /* driver allocates ZONE_DEVICE page for dst[] */
          migrate_vma_pages()
            migrate_vma_insert_page()
              page_add_new_anon_rmap()
                __page_set_anon_rmap()
                  /* This check sees the page's stale mapping field */
                  if (PageAnon(page))
                    return
                  /* page->mapping is not updated */
      
      The result is that the migration appears to succeed but a subsequent CPU
      fault will be unable to migrate the page back to system memory or worse.
      
      Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
      pointer data doesn't affect future page use.
      
      Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
      Fixes: b7a52310 ("mm: don't clear ->mapping in hmm_devmem_free")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ab0ad0e
  5. 09 Aug, 2019 1 commit
  6. 03 Aug, 2019 7 commits
  7. 31 Jul, 2019 1 commit
    • Laura Abbott's avatar
      mm: slub: Fix slab walking for init_on_free · 1b7e816f
      Laura Abbott authored
      To properly clear the slab on free with slab_want_init_on_free, we walk
      the list of free objects using get_freepointer/set_freepointer.
      
      The value we get from get_freepointer may not be valid.  This isn't an
      issue since an actual value will get written later but this means
      there's a chance of triggering a bug if we use this value with
      set_freepointer:
      
        kernel BUG at mm/slub.c:306!
        invalid opcode: 0000 [#1] PREEMPT PTI
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4
        RIP: 0010:kfree+0x58a/0x5c0
        Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 <0f> 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
        RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
        RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
        RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
        RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
        R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
        FS:  0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
        Call Trace:
         apply_wqattrs_prepare+0x154/0x280
         apply_workqueue_attrs_locked+0x4e/0xe0
         apply_workqueue_attrs+0x36/0x60
         alloc_workqueue+0x25a/0x6d0
         workqueue_init_early+0x246/0x348
         start_kernel+0x3c7/0x7ec
         x86_64_start_reservations+0x40/0x49
         x86_64_start_kernel+0xda/0xe4
         secondary_startup_64+0xb6/0xc0
        Modules linked in:
        ---[ end trace f67eb9af4d8d492b ]---
      
      Fix this by ensuring the value we set with set_freepointer is either NULL
      or another value in the chain.
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarLaura Abbott <labbott@redhat.com>
      Fixes: 6471384a ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b7e816f
  8. 25 Jul, 2019 1 commit