1. 12 Apr, 2018 1 commit
  2. 31 Mar, 2018 1 commit
  3. 26 Mar, 2018 3 commits
  4. 29 Jan, 2018 1 commit
  5. 24 Jan, 2018 1 commit
    • Josef Bacik's avatar
      Btrfs: fix stale entries in readdir · e4fd493c
      Josef Bacik authored
      In fixing the readdir+pagefault deadlock I accidentally introduced a
      stale entry regression in readdir.  If we get close to full for the
      temporary buffer, and then skip a few delayed deletions, and then try to
      add another entry that won't fit, we will emit the entries we found and
      retry.  Unfortunately we delete entries from our del_list as we find
      them, assuming we won't need them.  However our pos will be with
      whatever our last entry was, which could be before the delayed deletions
      we skipped, so the next search will add the deleted entries back into
      our readdir buffer.  So instead don't delete entries we find in our
      del_list so we can make sure we always find our delayed deletions.  This
      is a slight perf hit for readdir with lots of pending deletions, but
      hopefully this isn't a common occurrence.  If it is we can revist this
      and optimize it.
      
      cc: stable@vger.kernel.org
      Fixes: 23b5ec74 ("btrfs: fix readdir deadlock with pagefault")
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4fd493c
  6. 22 Jan, 2018 2 commits
  7. 02 Jan, 2018 1 commit
    • Chris Mason's avatar
      btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes · ec35e48b
      Chris Mason authored
      refcounts have a generic implementation and an asm optimized one.  The
      generic version has extra debugging to make sure that once a refcount
      goes to zero, refcount_inc won't increase it.
      
      The btrfs delayed inode code wasn't expecting this, and we're tripping
      over the warnings when the generic refcounts are used.  We ended up with
      this race:
      
      Process A                                         Process B
                                                        btrfs_get_delayed_node()
      						  spin_lock(root->inode_lock)
      						  radix_tree_lookup()
      __btrfs_release_delayed_node()
      refcount_dec_and_test(&delayed_node->refs)
      our refcount is now zero
      						  refcount_add(2) <---
      						  warning here, refcount
                                                        unchanged
      
      spin_lock(root->inode_lock)
      radix_tree_delete()
      
      With the generic refcounts, we actually warn again when process B above
      tries to release his refcount because refcount_add() turned into a
      no-op.
      
      We saw this in production on older kernels without the asm optimized
      refcounts.
      
      The fix used here is to use refcount_inc_not_zero() to detect when the
      object is in the middle of being freed and return NULL.  This is almost
      always the right answer anyway, since we usually end up pitching the
      delayed_node if it didn't have fresh data in it.
      
      This also changes __btrfs_release_delayed_node() to remove the extra
      check for zero refcounts before radix tree deletion.
      btrfs_get_delayed_node() was the only path that was allowing refcounts
      to go from zero to one.
      
      Fixes: 6de5f18e ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
      CC: <stable@vger.kernel.org> # 4.12+
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec35e48b
  8. 01 Nov, 2017 1 commit
    • Josef Bacik's avatar
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik authored
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      69fe2d75
  9. 18 Aug, 2017 1 commit
  10. 18 Apr, 2017 2 commits
  11. 28 Feb, 2017 1 commit
  12. 14 Feb, 2017 14 commits
  13. 13 Dec, 2016 1 commit
    • Maxim Patlasov's avatar
      btrfs: limit async_work allocation and worker func duration · 2939e1a8
      Maxim Patlasov authored
      Problem statement: unprivileged user who has read-write access to more than
      one btrfs subvolume may easily consume all kernel memory (eventually
      triggering oom-killer).
      
      Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):
      
      [root@kteam1 ~]# cat prep.sh
      
      DEV=/dev/sdb
      mkfs.btrfs -f $DEV
      mount $DEV /mnt
      for i in `seq 1 16`
      do
      	mkdir /mnt/$i
      	btrfs subvolume create /mnt/SV_$i
      	ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
      	mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
      	chmod a+rwx /mnt/$i
      done
      
      [root@kteam1 ~]# sh prep.sh
      
      [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done
      
      [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
      kmalloc-128        10144  10144    128   32    1 : tunables    0    0    0 : slabdata    317    317      0
      kmalloc-128       9992352 9992352    128   32    1 : tunables    0    0    0 : slabdata 312261 312261      0
      kmalloc-128       24226752 24226752    128   32    1 : tunables    0    0    0 : slabdata 757086 757086      0
      kmalloc-128       42754240 42754240    128   32    1 : tunables    0    0    0 : slabdata 1336070 1336070      0
      
      The huge numbers above come from insane number of async_work-s allocated
      and queued by btrfs_wq_run_delayed_node.
      
      The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
      works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
      worker func (btrfs_async_run_delayed_root) processes at least
      BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
      works as expected while the list is almost empty. As soon as it is getting
      bigger, worker func starts to process more than one item at a time, it takes
      longer, and the chances to have async_works queued more than needed is getting
      higher.
      
      The problem above is worsened by another flaw of delayed-inode implementation:
      if async_work was queued in a throttling branch (number of items >=
      BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
      the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
      the func occupies CPU infinitely (up to 30sec in my experiments): while the
      func is trying to drain the list, the user activity may add more and more
      items to the list.
      
      The patch fixes both problems in straightforward way: refuse queuing too
      many works in btrfs_wq_run_delayed_node and bail out of worker func if
      at least BTRFS_DELAYED_WRITEBACK items are processed.
      
      Changed in v2: remove support of thresh == NO_THRESHOLD.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v3.15+
      2939e1a8
  14. 06 Dec, 2016 5 commits
  15. 30 Nov, 2016 1 commit
    • Jeff Mahoney's avatar
      btrfs: increment ctx->pos for every emitted or skipped dirent in readdir · d2fbb2b5
      Jeff Mahoney authored
      If we process the last item in the leaf and hit an I/O error while
      reading the next leaf, we return -EIO without having adjusted the
      position.  Since we have emitted dirents, getdents() will return
      the byte count to the user instead of the error.  Subsequent callers
      will emit the last successful dirent again, and return -EIO again,
      with the same result.  Callers loop forever.
      
      Instead, if we always increment ctx->pos after emitting or skipping
      the dirent, we'll be sure that we won't hit the same one again.  When
      we go to process the next leaf, we won't have emitted any dirents
      and the -EIO will be returned to the user properly.  We also don't
      need to track if we've emitted a dirent already or if we've changed
      the position yet.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2fbb2b5
  16. 26 Sep, 2016 3 commits
  17. 26 Jul, 2016 1 commit