1. 21 Oct, 2018 8 commits
  2. 30 Sep, 2018 1 commit
    • Matthew Wilcox's avatar
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox authored
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
  3. 12 Sep, 2018 1 commit
  4. 30 Jul, 2018 1 commit
  5. 29 Jul, 2018 1 commit
  6. 23 Jul, 2018 1 commit
    • Dan Williams's avatar
      filesystem-dax: Introduce dax_lock_mapping_entry() · c2a7d2a1
      Dan Williams authored
      In preparation for implementing support for memory poison (media error)
      handling via dax mappings, implement a lock_page() equivalent. Poison
      error handling requires rmap and needs guarantees that the page->mapping
      association is maintained / valid (inode not freed) for the duration of
      the lookup.
      In the device-dax case it is sufficient to simply hold a dev_pagemap
      reference. In the filesystem-dax case we need to use the entry lock.
      Export the entry lock via dax_lock_mapping_entry() that uses
      rcu_read_lock() to protect against the inode being freed, and
      revalidates the page->mapping association under xa_lock().
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
  7. 20 Jul, 2018 1 commit
    • Dan Williams's avatar
      filesystem-dax: Set page->index · 73449daf
      Dan Williams authored
      In support of enabling memory_failure() handling for filesystem-dax
      mappings, set ->index to the pgoff of the page. The rmap implementation
      requires ->index to bound the search through the vma interval tree. The
      index is set and cleared at dax_associate_entry() and
      dax_disassociate_entry() time respectively.
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
  8. 08 Jun, 2018 1 commit
  9. 03 Jun, 2018 1 commit
  10. 23 May, 2018 2 commits
    • Dan Williams's avatar
      dax: Report bytes remaining in dax_iomap_actor() · a77d4786
      Dan Williams authored
      In preparation for protecting the dax read(2) path from media errors
      with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the
      implementation to report the bytes successfully transferred.
      Cc: <x86@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Dan Williams's avatar
      dax: Introduce a ->copy_to_iter dax operation · b3a9a0c3
      Dan Williams authored
      Similar to the ->copy_from_iter() operation, a platform may want to
      deploy an architecture or device specific routine for handling reads
      from a dax_device like /dev/pmemX. On x86 this routine will point to a
      machine check safe version of copy_to_iter(). For now, add the plumbing
      to device-mapper and the dax core.
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  11. 22 May, 2018 1 commit
    • Dan Williams's avatar
      mm, fs, dax: handle layout changes to pinned dax mappings · 5fac7408
      Dan Williams authored
      get_user_pages() in the filesystem pins file backed memory pages for
      access by devices performing dma. However, it only pins the memory pages
      not the page-to-file offset association. If a file is truncated the
      pages are mapped out of the file and dma may continue indefinitely into
      a page that is owned by a device driver. This breaks coherency of the
      file vs dma, but the assumption is that if userspace wants the
      file-space truncated it does not matter what data is inbound from the
      device, it is not relevant anymore. The only expectation is that dma can
      safely continue while the filesystem reallocates the block(s).
      This expectation that dma can safely continue while the filesystem
      changes the block map is broken by dax. With dax the target dma page
      *is* the filesystem block. The model of leaving the page pinned for dma,
      but truncating the file block out of the file, means that the filesytem
      is free to reallocate a block under active dma to another file and now
      the expected data-incoherency situation has turned into active
      Defer all filesystem operations (fallocate(), truncate()) on a dax mode
      file while any page/block in the file is under active dma. This solution
      assumes that dma is transient. Cases where dma operations are known to
      not be transient, like RDMA, have been explicitly disabled via
      commits like 5f1d43de "IB/core: disable memory registration of
      filesystem-dax vmas".
      The dax_layout_busy_page() routine is called by filesystems with a lock
      held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
      The process of looking up a busy page invalidates all mappings
      to trigger any subsequent get_user_pages() to block on i_mmap_lock.
      The filesystem continues to call dax_layout_busy_page() until it finally
      returns no more active pages. This approach assumes that the page
      pinning is transient, if that assumption is violated the system would
      have likely hung from the uncompleted I/O.
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  12. 16 Apr, 2018 1 commit
  13. 11 Apr, 2018 1 commit
  14. 03 Apr, 2018 1 commit
    • Dan Williams's avatar
      fs, dax: use page->mapping to warn if truncate collides with a busy page · d2c997c0
      Dan Williams authored
      Catch cases where extent unmap operations encounter pages that are
      pinned / busy. Typically this is pinned pages that are under active dma.
      This warning is a canary for potential data corruption as truncated
      blocks could be allocated to a new file while the device is still
      performing i/o.
      Here is an example of a collision that this implementation catches:
       WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
       Call Trace:
        ? tlb_gather_mmu+0x10/0x20
        ? up_write+0x1c/0x40
        ? unmap_mapping_range+0x73/0x140
        xfs_free_file_space+0x1b6/0x5b0 [xfs]
        ? xfs_file_fallocate+0x7f/0x320 [xfs]
        ? down_write_nested+0x40/0x70
        ? xfs_ilock+0x21d/0x2f0 [xfs]
        xfs_file_fallocate+0x162/0x320 [xfs]
        ? rcu_read_lock_sched_held+0x3f/0x70
        ? rcu_sync_lockdep_assert+0x2a/0x50
        ? __sb_start_write+0xd0/0x1b0
        ? vfs_fallocate+0x20c/0x270
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  15. 30 Mar, 2018 1 commit
  16. 01 Feb, 2018 2 commits
  17. 07 Jan, 2018 1 commit
  18. 16 Dec, 2017 1 commit
    • Linus Torvalds's avatar
      Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" · f6f37321
      Linus Torvalds authored
      This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c.
      We'll probably need to revisit this, but basically we should not
      complicate the get_user_pages_fast() case, and checking the actual page
      table protection key bits will require more care anyway, since the
      protection keys depend on the exact state of the VM in question.
      Particularly when doing a "remote" page lookup (ie in somebody elses VM,
      not your own), you need to be much more careful than this was.  Dave
      Hansen says:
       "So, the underlying bug here is that we now a get_user_pages_remote()
        and then go ahead and do the p*_access_permitted() checks against the
        current PKRU. This was introduced recently with the addition of the
        new p??_access_permitted() calls.
        We have checks in the VMA path for the "remote" gups and we avoid
        consulting PKRU for them. This got missed in the pkeys selftests
        because I did a ptrace read, but not a *write*. I also didn't
        explicitly test it against something where a COW needed to be done"
      It's also not entirely clear that it makes sense to check the protection
      key bits at this level at all.  But one possible eventual solution is to
      make the get_user_pages_fast() case just abort if it sees protection key
      bits set, which makes us fall back to the regular get_user_pages() case,
      which then has a vma and can do the check there if we want to.
      We'll see.
      Somewhat related to this all: what we _do_ want to do some day is to
      check the PAGE_USER bit - it should obviously always be set for user
      pages, but it would be a good check to have back.  Because we have no
      generic way to test for it, we lost it as part of moving over from the
      architecture-specific x86 GUP implementation to the generic one in
      commit e585513b ("x86/mm/gup: Switch GUP to the generic
      get_user_page_fast() implementation").
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  19. 30 Nov, 2017 1 commit
  20. 16 Nov, 2017 3 commits
    • Mel Gorman's avatar
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman authored
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman authored
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      Below is a more in depth analysis of why this is fine to do this:
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      Consider the following scenario (device use a feature similar to ATS/
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      Thanks to Andrea for thinking of a problematic scenario for COW.
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  21. 15 Nov, 2017 1 commit
    • Jeff Moyer's avatar
      dax: fix PMD faults on zero-length files · 957ac8c4
      Jeff Moyer authored
      PMD faults on a zero length file on a file system mounted with -o dax
      will not generate SIGBUS as expected.
      	fd = open(...O_TRUNC);
      	addr = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	*addr = 'a';
              <expect SIGBUS>
      The problem is this code in dax_iomap_pmd_fault:
      	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
      If the inode size is zero, we end up with a max_pgoff that is way larger
      than 0.  :)  Fix it by using DIV_ROUND_UP, as is done elsewhere in the
      I tested this with some simple test code that ensured that SIGBUS was
      received where expected.
      Cc: <stable@vger.kernel.org>
      Fixes: 642261ac ("dax: add struct iomap based DAX PMD support")
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  22. 14 Nov, 2017 1 commit
  23. 03 Nov, 2017 7 commits