1. 01 Nov, 2019 6 commits
  2. 30 Sep, 2019 5 commits
  3. 26 Sep, 2019 8 commits
    • Oliver O'Halloran's avatar
      powerpc/eeh: Fix eeh eeh_debugfs_break_device() with SRIOV devices · 253c8921
      Oliver O'Halloran authored
      s/CONFIG_IOV/CONFIG_PCI_IOV/
      
      Whoops.
      
      Fixes: bd6461cc
      
       ("powerpc/eeh: Add a eeh_dev_break debugfs interface")
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      [mpe: Fixup the #endif comment as well]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190926122502.14826-1-oohall@gmail.com
      253c8921
    • Andrew Morton's avatar
      arch/sparc/include/asm/pgtable_64.h: fix build · a22fea94
      Andrew Morton authored
      A last-minute fixlet which I'd failed to merge at the appropriate time
      had the predictable effect.
      
      Fixes: f672e2c2
      
       ("lib: untag user pointers in strn*_user")
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a22fea94
    • Mark Rutland's avatar
      mm: treewide: clarify pgtable_page_{ctor,dtor}() naming · b4ed71f5
      Mark Rutland authored
      The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
      people, and until recently arm64 used these erroneously/pointlessly for
      other levels of page table.
      
      To make it incredibly clear that these only apply to the PTE level, and to
      align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
      to pgtable_pte_page_{ctor,dtor}().
      
      These changes were generated with the following shell script:
      
      ----
      git grep -lw 'pgtable_page_.tor' | while read FILE; do
          sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
          sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
      done
      ----
      
      ... with the documentation re-flowed to remain under 80 columns, and
      whitespace fixed up in macros to keep backslashes aligned.
      
      There should be no functional change as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.com
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4ed71f5
    • Mike Rapoport's avatar
      hexagon: drop empty and unused free_initrd_mem · c7cc8d77
      Mike Rapoport authored
      hexagon never reserves or initializes initrd and the only mention of it is
      the empty free_initrd_mem() function.
      
      As we have a generic implementation of free_initrd_mem(), there is no need
      to define an empty stub for the hexagon implementation and it can be
      dropped.
      
      Link: http://lkml.kernel.org/r/1565858133-25852-1-git-send-email-rppt@linux.ibm.com
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7cc8d77
    • Minchan Kim's avatar
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim authored
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • Minchan Kim's avatar
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim authored
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
    • Andrey Konovalov's avatar
      lib: untag user pointers in strn*_user · 903f433f
      Andrey Konovalov authored
      Patch series "arm64: untag user pointers passed to the kernel", v19.
      
      === Overview
      
      arm64 has a feature called Top Byte Ignore, which allows to embed pointer
      tags into the top byte of each pointer.  Userspace programs (such as
      HWASan, a memory debugging tool [1]) might use this feature and pass
      tagged user pointers to the kernel through syscalls or other interfaces.
      
      Right now the kernel is already able to handle user faults with tagged
      pointers, due to these patches:
      
      1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
                   tagged pointer")
      2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
      	      pointers")
      3. 276e9327 ("arm64: entry: improve data abort handling of tagged
      	      pointers")
      
      This patchset extends tagged pointer support to syscall arguments.
      
      As per the proposed ABI change [3], tagged pointers are only allowed to be
      passed to syscalls when they point to memory ranges obtained by anonymous
      mmap() or sbrk() (see the patchset [3] for more details).
      
      For non-memory syscalls this is done by untaging user pointers when the
      kernel performs pointer checking to find out whether the pointer comes
      from userspace (most notably in access_ok).  The untagging is done only
      when the pointer is being checked, the tag is preserved as the pointer
      makes its way through the kernel and stays tagged when the kernel
      dereferences the pointer when perfoming user memory accesses.
      
      The mmap and mremap (only new_addr) syscalls do not currently accept
      tagged addresses.  Architectures may interpret the tag as a background
      colour for the corresponding vma.
      
      Other memory syscalls (mprotect, etc.) don't do user memory accesses but
      rather deal with memory ranges, and untagged pointers are better suited to
      describe memory ranges internally.  Thus for memory syscalls we untag
      pointers completely when they enter the kernel.
      
      === Other approaches
      
      One of the alternative approaches to untagging that was considered is to
      completely strip the pointer tag as the pointer enters the kernel with
      some kind of a syscall wrapper, but that won't work with the countless
      number of different ioctl calls.  With this approach we would need a
      custom wrapper for each ioctl variation, which doesn't seem practical.
      
      An alternative approach to untagging pointers in memory syscalls prologues
      is to inspead allow tagged pointers to be passed to find_vma() (and other
      vma related functions) and untag them there.  Unfortunately, a lot of
      find_vma() callers then compare or subtract the returned vma start and end
      fields against the pointer that was being searched.  Thus this approach
      would still require changing all find_vma() callers.
      
      === Testing
      
      The following testing approaches has been taken to find potential issues
      with user pointer untagging:
      
      1. Static testing (with sparse [2] and separately with a custom static
         analyzer based on Clang) to track casts of __user pointers to integer
         types to find places where untagging needs to be done.
      
      2. Static testing with grep to find parts of the kernel that call
         find_vma() (and other similar functions) or directly compare against
         vm_start/vm_end fields of vma.
      
      3. Static testing with grep to find parts of the kernel that compare
         user pointers with TASK_SIZE or other similar consts and macros.
      
      4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
         a modified syzkaller version that passes tagged pointers to the kernel.
      
      Based on the results of the testing the requried patches have been added
      to the patchset.
      
      === Notes
      
      This patchset is meant to be merged together with "arm64 relaxed ABI" [3].
      
      This patchset is a prerequisite for ARM's memory tagging hardware feature
      support [4].
      
      This patchset has been merged into the Pixel 2 & 3 kernel trees and is
      now being used to enable testing of Pixel phones with HWASan.
      
      Thanks!
      
      [1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
      
      [2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292
      
      [3] https://lkml.org/lkml/2019/6/12/745
      
      [4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
      
      This patch (of 11)
      
      This patch is a part of a series that extends kernel ABI to allow to pass
      tagged user pointers (with the top byte set to something else other than
      0x00) as syscall arguments.
      
      strncpy_from_user and strnlen_user accept user addresses as arguments, and
      do not go through the same path as copy_from_user and others, so here we
      need to handle the case of tagged user addresses separately.
      
      Untag user pointers passed to these functions.
      
      Note, that this patch only temporarily untags the pointers to perform
      validity checks, but then uses them as is to perform user memory accesses.
      
      [andreyknvl@google.com: fix sparc4 build]
       Link: http://lkml.kernel.org/r/CAAeHK+yx4a-P0sDrXTUxMvO2H0CJZUFPffBrg_cU7oJOZyC7ew@mail.gmail.com
      Link: http://lkml.kernel.org/r/c5a78bcad3e94d6cda71fcaa60a423231ae71e4c.1563904656.git.andreyknvl@google.com
      
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jens Wiklander <jens.wiklander@linaro.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      903f433f
    • Michel Lespinasse's avatar
      augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro · 315cc066
      Michel Lespinasse authored
      Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
      for the case where the augmented value is a scalar whose definition
      follows a max(f(node)) pattern.  This actually covers all present uses of
      RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
      various RBCOMPUTE function definitions.
      
      [walken@google.com: fix mm/vmalloc.c]
        Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
      [walken@google.com: re-add check to check_augmented()]
        Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
      Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      315cc066
  4. 25 Sep, 2019 7 commits
  5. 24 Sep, 2019 14 commits