1. 06 Dec, 2018 1 commit
    • Matthew Wilcox's avatar
      radix tree: Don't return retry entries from lookup · eff3860b
      Matthew Wilcox authored
      Commit 66ee620f ("idr: Permit any valid kernel pointer to be stored")
      changed the radix tree lookup so that it stops when reaching the bottom
      of the tree.  However, the condition was added in the wrong place,
      making it possible to return retry entries to the caller.  Reorder the
      tests to check for the retry entry before checking whether we're at the
      bottom of the tree.  The retry entry should never be found in the tree
      root, so it's safe to defer the check until the end of the loop.
      Add a regression test to the test-suite to be sure this doesn't come
      Fixes: 66ee620f ("idr: Permit any valid kernel pointer to be stored")
      Reported-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
  2. 21 Oct, 2018 23 commits
  3. 30 Sep, 2018 3 commits
    • Matthew Wilcox's avatar
      xarray: Change definition of sibling entries · 02c02bf1
      Matthew Wilcox authored
      Instead of storing a pointer to the slot containing the canonical entry,
      store the offset of the slot.  Produces slightly more efficient code
      (~300 bytes) and simplifies the implementation.
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    • Matthew Wilcox's avatar
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox authored
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    • Matthew Wilcox's avatar
      idr: Permit any valid kernel pointer to be stored · 66ee620f
      Matthew Wilcox authored
      An upcoming change to the encoding of internal entries will set the bottom
      two bits to 0b10.  Unfortunately, m68k only aligns some data structures
      to 2 bytes, so the IDR will interpret them as internal entries and things
      will go badly wrong.
      Change the radix tree so that it stops either when the node indicates
      that it's the bottom of the tree (shift == 0) or when the entry is not an
      internal entry.  This means we cannot insert an arbitrary kernel pointer
      as a multiorder entry, but the IDR does not permit multiorder entries.
      Annoyingly, this means the IDR can no longer take advantage of the radix
      tree's ability to store a single entry at offset 0 without allocating
      memory.  A pointer which is 2-byte aligned cannot be stored directly in
      the root as it would be indistinguishable from a node, so we must allocate
      a node in order to store a 2-byte pointer at index 0.  The idr_replace()
      function does not take a GFP flags argument, so cannot allocate memory.
      If a user inserts a 4-byte aligned pointer at index 0 and then replaces
      it with a 2-byte aligned pointer, we must be able to store it.
      Arbitrary pointer values are still not permitted; pointers of the
      form 2 + (i * 4) for values of i between 0 and 1023 are reserved for
      the implementation.  These are not valid kernel pointers as they would
      point into the zero page.
      This change does cause a runtime memory consumption regression for
      the IDA.  I will recover that later.
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
  4. 22 Aug, 2018 8 commits
  5. 26 May, 2018 1 commit
    • Matthew Wilcox's avatar
      idr: fix invalid ptr dereference on item delete · 7a4deea1
      Matthew Wilcox authored
      If the radix tree underlying the IDR happens to be full and we attempt
      to remove an id which is larger than any id in the IDR, we will call
      __radix_tree_delete() with an uninitialised 'slot' pointer, at which
      point anything could happen.  This was easiest to hit with a single
      entry at id 0 and attempting to remove a non-0 id, but it could have
      happened with 64 entries and attempting to remove an id >= 64.
      Roman said:
        The syzcaller test boils down to opening /dev/kvm, creating an
        eventfd, and calling a couple of KVM ioctls. None of this requires
        superuser. And the result is dereferencing an uninitialized pointer
        which is likely a crash. The specific path caught by syzbot is via
        KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
        other user-triggerable paths, so cc:stable is probably justified.
      Matthew added:
        We have around 250 calls to idr_remove() in the kernel today. Many of
        them pass an ID which is embedded in the object they're removing, so
        they're safe. Picking a few likely candidates:
        drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
        drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
        drivers/atm/nicstar.c could be taken down by a handcrafted packet
      Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
      Fixes: 0a835c4f ("Reimplement IDR and IDA using the radix tree")
      Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com>
      Debugged-by: default avatarRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 19 May, 2018 3 commits
    • Ross Zwisler's avatar
      radix tree test suite: multi-order iteration race · fd8f58c4
      Ross Zwisler authored
      Add a test which shows a race in the multi-order iteration code.  This
      test reliably hits the race in under a second on my machine, and is the
      result of a real bug report against kernel a production v4.15 based
      kernel (4.15.6-300.fc27.x86_64).  With a real kernel this issue is hit
      when using order 9 PMD DAX radix tree entries.
      The race has to do with how we tear down multi-order sibling entries
      when we are removing an item from the tree.  Remember that an order 2
      entry looks like this:
        struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
      where 'entry' is in some slot in the struct radix_tree_node, and the
      three slots following 'entry' contain sibling pointers which point back
      to 'entry.'
      When we delete 'entry' from the tree, we call :
      replace_slot() first removes the siblings in order from the first to the
      last, then at then replaces 'entry' with NULL.  This means that for a
      brief period of time we end up with one or more of the siblings removed,
        struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
      This causes an issue if you have a reader iterating over the slots in
      the tree via radix_tree_for_each_slot() while only under
      rcu_read_lock()/rcu_read_unlock() protection.  This is a common case in
      The issue is that when __radix_tree_next_slot() => skip_siblings() tries
      to skip over the sibling entries in the slots, it currently does so with
      an exact match on the slot directly preceding our current slot.
      Normally this works:
                                            V preceding slot
        struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
                                                    ^ current slot
      This lets you find the first sibling, and you skip them all in order.
      But in the case where one of the siblings is NULL, that slot is skipped
      and then our sibling detection is interrupted:
                                                   V preceding slot
        struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
                                                          ^ current slot
      This means that the sibling pointers aren't recognized since they point
      all the way back to 'entry', so we think that they are normal internal
      radix tree pointers.  This causes us to think we need to walk down to a
      struct radix_tree_node starting at the address of 'entry'.
      In a real running kernel this will crash the thread with a GP fault when
      you try and dereference the slots in your broken node starting at
      In the radix tree test suite this will be caught by the address
        ==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address
        0x60c0008ae400 at pc 0x00000040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0
        READ of size 8 at 0x60c0008ae400 thread T3
            #0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660
            #1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567
            #2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655
            #3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a)
            #4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e)
      Link: http://lkml.kernel.org/r/20180503192430.7582-5-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Ross Zwisler's avatar
      radix tree test suite: add item_delete_rcu() · 3e252fa7
      Ross Zwisler authored
      Currently the lifetime of "struct item" entries in the radix tree are
      not controlled by RCU, but are instead deleted inline as they are
      removed from the tree.
      In the following patches we add a test which has threads iterating over
      items pulled from the tree and verifying them in an
      rcu_read_lock()/rcu_read_unlock() section.  This means that though an
      item has been removed from the tree it could still be being worked on by
      other threads until the RCU grace period expires.  So, we need to
      actually free the "struct item" structures at the end of the grace
      period, just as we do with "struct radix_tree_node" items.
      Link: http://lkml.kernel.org/r/20180503192430.7582-4-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Ross Zwisler's avatar
      radix tree test suite: fix mapshift build target · 8d9fa88e
      Ross Zwisler authored
      Commit c6ce3e2f ("radix tree test suite: Add config option for map
      shift") introduced a phony makefile target called 'mapshift' that ends
      up generating the file generated/map-shift.h.  This phony target was
      then added as a dependency of the top level 'targets' build target,
      which is what is run when you go to tools/testing/radix-tree and just
      type 'make'.
      Unfortunately, this phony target doesn't actually work as a dependency,
      so you end up getting:
        $ make
        make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'.  Stop.
        make: *** Waiting for unfinished jobs....
      Fix this by making the file generated/map-shift.h our real makefile
      target, and add this a dependency of the top level build target.
      Link: http://lkml.kernel.org/r/20180503192430.7582-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 11 Apr, 2018 1 commit
    • Matthew Wilcox's avatar
      radix tree: use GFP_ZONEMASK bits of gfp_t for flags · fa290cda
      Matthew Wilcox authored
      Patch series "XArray", v9.  (First part thereof).
      This patchset is, I believe, appropriate for merging for 4.17.  It
      contains the XArray implementation, to eventually replace the radix
      tree, and converts the page cache to use it.
      This conversion keeps the radix tree and XArray data structures in sync
      at all times.  That allows us to convert the page cache one function at
      a time and should allow for easier bisection.  Other than renaming some
      elements of the structures, the data structures are fundamentally
      unchanged; a radix tree walk and an XArray walk will touch the same
      number of cachelines.  I have changes planned to the XArray data
      structure, but those will happen in future patches.
      Improvements the XArray has over the radix tree:
       - The radix tree provides operations like other trees do; 'insert' and
         'delete'. But what most users really want is an automatically
         resizing array, and so it makes more sense to give users an API that
         is like an array -- 'load' and 'store'. We still have an 'insert'
         operation for users that really want that semantic.
       - The XArray considers locking as part of its API. This simplifies a
         lot of users who formerly had to manage their own locking just for
         the radix tree. It also improves code generation as we can now tell
         RCU that we're holding a lock and it doesn't need to generate as much
         fencing code. The other advantage is that tree nodes can be moved
         (not yet implemented).
       - GFP flags are now parameters to calls which may need to allocate
         memory. The radix tree forced users to decide what the allocation
         flags would be at creation time. It's much clearer to specify them at
         allocation time.
       - Memory is not preloaded; we don't tie up dozens of pages on the off
         chance that the slab allocator fails. Instead, we drop the lock,
         allocate a new node and retry the operation. We have to convert all
         the radix tree, IDA and IDR preload users before we can realise this
         benefit, but I have not yet found a user which cannot be converted.
       - The XArray provides a cmpxchg operation. The radix tree forces users
         to roll their own (and at least four have).
       - Iterators take a 'max' parameter. That simplifies many users and will
         reduce the amount of iteration done.
       - Iteration can proceed backwards. We only have one user for this, but
         since it's called as part of the pagefault readahead algorithm, that
         seemed worth mentioning.
       - RCU-protected pointers are not exposed as part of the API. There are
         some fun bugs where the page cache forgets to use rcu_dereference()
         in the current codebase.
       - Value entries gain an extra bit compared to radix tree exceptional
         entries. That gives us the extra bit we need to put huge page swap
         entries in the page cache.
       - Some iterators now take a 'filter' argument instead of having
         separate iterators for tagged/untagged iterations.
      The page cache is improved by this:
       - Shorter, easier to read code
       - More efficient iterations
       - Reduction in size of struct address_space
       - Fewer walks from the top of the data structure; the XArray API
         encourages staying at the leaf node and conducting operations there.
      This patch (of 8):
      None of these bits may be used for slab allocations, so we can use them
      as radix tree flags as long as we mask them off before passing them to
      the slab allocator. Move the IDR flag from the high bits to the
      GFP_ZONEMASK bits.
      Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>