Skip to content
Snippets Groups Projects
  1. Dec 03, 2021
  2. Nov 22, 2021
    • Nadav Amit's avatar
      hugetlbfs: flush before unlock on move_hugetlb_page_tables() · 13e4ad2c
      Nadav Amit authored
      
      We must flush the TLB before releasing i_mmap_rwsem to avoid the
      potential reuse of an unshared PMDs page.  This is not true in the case
      of move_hugetlb_page_tables().  The last reference on the page table can
      therefore be dropped before the TLB flush took place.
      
      Prevent it by reordering the operations and flushing the TLB before
      releasing i_mmap_rwsem.
      
      Fixes: 550a7d60 ("mm, hugepages: add mremap() support for hugepage backed vma")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13e4ad2c
    • Nadav Amit's avatar
      hugetlbfs: flush TLBs correctly after huge_pmd_unshare · a4a118f2
      Nadav Amit authored
      
      When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB
      flush is missing.  This TLB flush must be performed before releasing the
      i_mmap_rwsem, in order to prevent an unshared PMDs page from being
      released and reused before the TLB flush took place.
      
      Arguably, a comprehensive solution would use mmu_gather interface to
      batch the TLB flushes and the PMDs page release, however it is not an
      easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call
      huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2)
      deferring the release of the page reference for the PMDs page until
      after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into
      thinking PMDs are shared when they are not.
      
      Fix __unmap_hugepage_range() by adding the missing TLB flush, and
      forcing a flush when unshare is successful.
      
      Fixes: 24669e58 ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4a118f2
  3. Nov 20, 2021
  4. Nov 18, 2021
  5. Nov 17, 2021
  6. Nov 13, 2021
  7. Nov 11, 2021
  8. Nov 10, 2021
  9. Nov 09, 2021
    • David Hildenbrand's avatar
      kernel/resource: disallow access to exclusive system RAM regions · a9e7b8d4
      David Hildenbrand authored
      virtio-mem dynamically exposes memory inside a device memory region as
      system RAM to Linux, coordinating with the hypervisor which parts are
      actually "plugged" and consequently usable/accessible.
      
      On the one hand, the virtio-mem driver adds/removes whole memory blocks,
      creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other
      hand, it logically (un)plugs memory inside added memory blocks,
      dynamically either exposing them to the buddy or hiding them from the
      buddy and marking them PG_offline.
      
      In contrast to physical devices, like a DIMM, the virtio-mem driver is
      required to actually make use of any of the device-provided memory,
      because it performs the handshake with the hypervisor.  virtio-mem
      memory cannot simply be access via /dev/mem without a driver.
      
      There is no safe way to:
      a) Access plugged memory blocks via /dev/mem, as they might contain
         unplugged holes or might get silently unplugged by the virtio-mem
         driver and consequently turned inaccessible.
      b) Access unplugged memory blocks via /dev/mem because the virtio-mem
         driver is required to make them actually accessible first.
      
      The virtio-spec states that unplugged memory blocks MUST NOT be written,
      and only selected unplugged memory blocks MAY be read.  We want to make
      sure, this is the case in sane environments -- where the virtio-mem driver
      was loaded.
      
      We want to make sure that in a sane environment, nobody "accidentially"
      accesses unplugged memory inside the device managed region.  For example,
      a user might spot a memory region in /proc/iomem and try accessing it via
      /dev/mem via gdb or dumping it via something else.  By the time the mmap()
      happens, the memory might already have been removed by the virtio-mem
      driver silently: the mmap() would succeeed and user space might
      accidentially access unplugged memory.
      
      So once the driver was loaded and detected the device along the
      device-managed region, we just want to disallow any access via /dev/mem to
      it.
      
      In an ideal world, we would mark the whole region as busy ("owned by a
      driver") and exclude it; however, that would be wrong, as we don't really
      have actual system RAM at these ranges added to Linux ("busy system RAM").
      Instead, we want to mark such ranges as "not actual busy system RAM but
      still soft-reserved and prepared by a driver for future use."
      
      Let's teach iomem_is_exclusive() to reject access to any range with
      "IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
      if "iomem=relaxed" is set.  Introduce EXCLUSIVE_SYSTEM_RAM to make it
      easier for applicable drivers to depend on this setting in their Kconfig.
      
      For now, there are no applicable ranges and we'll modify virtio-mem next
      to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
      creates to contain all actual busy system RAM added via
      add_memory_driver_managed().
      
      Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e7b8d4
    • Kefeng Wang's avatar
      mm: kasan: use is_kernel() helper · 3298cbe8
      Kefeng Wang authored
      Directly use is_kernel() helper in kernel_or_module_addr().
      
      Link: https://lkml.kernel.org/r/20210930071143.63410-8-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3298cbe8
    • Imran Khan's avatar
      lib, stackdepot: add helper to print stack entries into buffer · 0f68d45e
      Imran Khan authored
      To print stack entries into a buffer, users of stackdepot, first get a
      list of stack entries using stack_depot_fetch and then print this list
      into a buffer using stack_trace_snprint.  Provide a helper in stackdepot
      for this purpose.  Also change above mentioned users to use this helper.
      
      [imran.f.khan@oracle.com: fix build error]
        Link: https://lkml.kernel.org/r/20210915175321.3472770-4-imran.f.khan@oracle.com
      [imran.f.khan@oracle.com: export stack_depot_snprint() to modules]
        Link: https://lkml.kernel.org/r/20210916133535.3592491-4-imran.f.khan@oracle.com
      
      Link: https://lkml.kernel.org/r/20210915014806.3206938-4-imran.f.khan@oracle.com
      
      
      Signed-off-by: default avatarImran Khan <imran.f.khan@oracle.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: Jani Nikula <jani.nikula@intel.com>	[i915]
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f68d45e
    • Imran Khan's avatar
      lib, stackdepot: add helper to print stack entries · 505be481
      Imran Khan authored
      To print a stack entries, users of stackdepot, first use stack_depot_fetch
      to get a list of stack entries and then use stack_trace_print to print
      this list.  Provide a helper in stackdepot to print stack entries based on
      stackdepot handle.  Also change above mentioned users to use this helper.
      
      Link: https://lkml.kernel.org/r/20210915014806.3206938-3-imran.f.khan@oracle.com
      
      
      Signed-off-by: default avatarImran Khan <imran.f.khan@oracle.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505be481
    • zhangyiru's avatar
      mm,hugetlb: remove mlock ulimit for SHM_HUGETLB · 83c1fd76
      zhangyiru authored
      Commit 21a3c273 ("mm, hugetlb: add thread name and pid to
      SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012,
      but it is not deleted yet.
      
      Mike says he still sees that message in log files on occasion, so maybe we
      should preserve this warning.
      
      Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the
      user_shm_unlock after out.
      
      Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com
      
      
      Signed-off-by: default avatarzhangyiru <zhangyiru3@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Liu Zixian <liuzixian4@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: wuxu.wu <wuxu.wu@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83c1fd76
    • Johannes Weiner's avatar
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner authored
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  10. Nov 08, 2021
    • Qian Cai's avatar
      arm64: Track no early_pgtable_alloc() for kmemleak · c6975d7c
      Qian Cai authored
      
      After switched page size from 64KB to 4KB on several arm64 servers here,
      kmemleak starts to run out of early memory pool due to a huge number of
      those early_pgtable_alloc() calls:
      
        kmemleak_alloc_phys()
        memblock_alloc_range_nid()
        memblock_phys_alloc_range()
        early_pgtable_alloc()
        init_pmd()
        alloc_init_pud()
        __create_pgd_mapping()
        __map_memblock()
        paging_init()
        setup_arch()
        start_kernel()
      
      Increased the default value of DEBUG_KMEMLEAK_MEM_POOL_SIZE by 4 times
      won't be enough for a server with 200GB+ memory. There isn't much
      interesting to check memory leaks for those early page tables and those
      early memory mappings should not reference to other memory. Hence, no
      kmemleak false positives, and we can safely skip tracking those early
      allocations from kmemleak like we did in the commit fed84c78
      ("mm/memblock.c: skip kmemleak for kasan_init()") without needing to
      introduce complications to automatically scale the value depends on the
      runtime memory size etc. After the patch, the default value of
      DEBUG_KMEMLEAK_MEM_POOL_SIZE becomes sufficient again.
      
      Signed-off-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Link: https://lore.kernel.org/r/20211105150509.7826-1-quic_qiancai@quicinc.com
      
      
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      c6975d7c
  11. Nov 06, 2021
Loading