Skip to content
Snippets Groups Projects
  1. Nov 26, 2020
    • Pavel Begunkov's avatar
      io_uring: fix files grab/cancel race · af604703
      Pavel Begunkov authored
      
      When one task is in io_uring_cancel_files() and another is doing
      io_prep_async_work() a race may happen. That's because after accounting
      a request inflight in first call to io_grab_identity() it still may fail
      and go to io_identity_cow(), which migh briefly keep dangling
      work.identity and not only.
      
      Grab files last, so io_prep_async_work() won't fail if it did get into
      ->inflight_list.
      
      note: the bug shouldn't exist after making io_uring_cancel_files() not
      poking into other tasks' requests.
      
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af604703
  2. Nov 25, 2020
    • Ard Biesheuvel's avatar
      efivarfs: revert "fix memory leak in efivarfs_create()" · ff04f3b6
      Ard Biesheuvel authored
      
      The memory leak addressed by commit fe5186cf is a false positive:
      all allocations are recorded in a linked list, and freed when the
      filesystem is unmounted. This leads to double frees, and as reported
      by David, leads to crashes if SLUB is configured to self destruct when
      double frees occur.
      
      So drop the redundant kfree() again, and instead, mark the offending
      pointer variable so the allocation is ignored by kmemleak.
      
      Cc: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
      Fixes: fe5186cf ("efivarfs: fix memory leak in efivarfs_create()")
      Reported-by: default avatarDavid Laight <David.Laight@aculab.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      ff04f3b6
  3. Nov 24, 2020
    • Pavel Begunkov's avatar
      io_uring: fix ITER_BVEC check · 9c3a205c
      Pavel Begunkov authored
      
      iov_iter::type is a bitmask that also keeps direction etc., so it
      shouldn't be directly compared against ITER_*. Use proper helper.
      
      Fixes: ff6165b2 ("io_uring: retain iov_iter state over io_read/io_write calls")
      Reported-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Cc: <stable@vger.kernel.org> # 5.9
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c3a205c
    • Joseph Qi's avatar
      io_uring: fix shift-out-of-bounds when round up cq size · eb2667b3
      Joseph Qi authored
      
      Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create():
      
      [ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
      [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int'
      [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3
      [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 59.603673] Call Trace:
      [ 59.604286] dump_stack+0x107/0x163
      [ 59.605237] ubsan_epilogue+0xb/0x5a
      [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e
      [ 59.607335] ? lock_downgrade+0x6c0/0x6c0
      [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0
      [ 59.609166] io_uring_create.cold+0x99/0x149
      [ 59.610114] io_uring_setup+0xd6/0x140
      [ 59.610975] ? io_uring_create+0x2510/0x2510
      [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400
      [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80
      [ 59.614038] ? trace_hardirqs_on+0x5b/0x180
      [ 59.615056] do_syscall_64+0x2d/0x40
      [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 59.617007] RIP: 0033:0x7f2bb8a0b239
      
      This is caused by roundup_pow_of_two() if the input entries larger
      enough, e.g. 2^32-1. For sq_entries, it will check first and we allow
      at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do
      round up first, that may overflow and truncate it to 0, which is not
      the expected behavior. So check the cq size first and then do round up.
      
      Fixes: 88ec3211 ("io_uring: round-up cq size before comparing with rounded sq size")
      Reported-by: default avatarAbaci Fuzz <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eb2667b3
  4. Nov 23, 2020
    • Filipe Manana's avatar
      btrfs: fix lockdep splat when enabling and disabling qgroups · a855fbe6
      Filipe Manana authored
      
      When running test case btrfs/017 from fstests, lockdep reported the
      following splat:
      
        [ 1297.067385] ======================================================
        [ 1297.067708] WARNING: possible circular locking dependency detected
        [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
        [ 1297.068322] ------------------------------------------------------
        [ 1297.068629] btrfs/189080 is trying to acquire lock:
        [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.069274]
      		 but task is already holding lock:
        [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
        [ 1297.070219]
      		 which lock already depends on the new lock.
      
        [ 1297.071131]
      		 the existing dependency chain (in reverse order) is:
        [ 1297.071721]
      		 -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
        [ 1297.072375]        lock_acquire+0xd8/0x490
        [ 1297.072710]        __mutex_lock+0xa3/0xb30
        [ 1297.073061]        btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
        [ 1297.073421]        create_subvol+0x194/0x990 [btrfs]
        [ 1297.073780]        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        [ 1297.074133]        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        [ 1297.074498]        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        [ 1297.074872]        btrfs_ioctl+0x1a90/0x36f0 [btrfs]
        [ 1297.075245]        __x64_sys_ioctl+0x83/0xb0
        [ 1297.075617]        do_syscall_64+0x33/0x80
        [ 1297.075993]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.076380]
      		 -> #0 (sb_internal#2){.+.+}-{0:0}:
        [ 1297.077166]        check_prev_add+0x91/0xc60
        [ 1297.077572]        __lock_acquire+0x1740/0x3110
        [ 1297.077984]        lock_acquire+0xd8/0x490
        [ 1297.078411]        start_transaction+0x3c5/0x760 [btrfs]
        [ 1297.078853]        btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.079323]        btrfs_ioctl+0x2c60/0x36f0 [btrfs]
        [ 1297.079789]        __x64_sys_ioctl+0x83/0xb0
        [ 1297.080232]        do_syscall_64+0x33/0x80
        [ 1297.080680]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.081139]
      		 other info that might help us debug this:
      
        [ 1297.082536]  Possible unsafe locking scenario:
      
        [ 1297.083510]        CPU0                    CPU1
        [ 1297.084005]        ----                    ----
        [ 1297.084500]   lock(&fs_info->qgroup_ioctl_lock);
        [ 1297.084994]                                lock(sb_internal#2);
        [ 1297.085485]                                lock(&fs_info->qgroup_ioctl_lock);
        [ 1297.085974]   lock(sb_internal#2);
        [ 1297.086454]
      		  *** DEADLOCK ***
        [ 1297.087880] 3 locks held by btrfs/189080:
        [ 1297.088324]  #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
        [ 1297.088799]  #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
        [ 1297.089284]  #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
        [ 1297.089771]
      		 stack backtrace:
        [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
        [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [ 1297.092123] Call Trace:
        [ 1297.092629]  dump_stack+0x8d/0xb5
        [ 1297.093115]  check_noncircular+0xff/0x110
        [ 1297.093596]  check_prev_add+0x91/0xc60
        [ 1297.094076]  ? kvm_clock_read+0x14/0x30
        [ 1297.094553]  ? kvm_sched_clock_read+0x5/0x10
        [ 1297.095029]  __lock_acquire+0x1740/0x3110
        [ 1297.095510]  lock_acquire+0xd8/0x490
        [ 1297.095993]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.096476]  start_transaction+0x3c5/0x760 [btrfs]
        [ 1297.096962]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.097451]  btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.097941]  ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
        [ 1297.098429]  btrfs_ioctl+0x2c60/0x36f0 [btrfs]
        [ 1297.098904]  ? do_user_addr_fault+0x20c/0x430
        [ 1297.099382]  ? kvm_clock_read+0x14/0x30
        [ 1297.099854]  ? kvm_sched_clock_read+0x5/0x10
        [ 1297.100328]  ? sched_clock+0x5/0x10
        [ 1297.100801]  ? sched_clock_cpu+0x12/0x180
        [ 1297.101272]  ? __x64_sys_ioctl+0x83/0xb0
        [ 1297.101739]  __x64_sys_ioctl+0x83/0xb0
        [ 1297.102207]  do_syscall_64+0x33/0x80
        [ 1297.102673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.103148] RIP: 0033:0x7f773ff65d87
      
      This is because during the quota enable ioctl we lock first the mutex
      qgroup_ioctl_lock and then start a transaction, and starting a transaction
      acquires a fs freeze semaphore (at the VFS level). However, every other
      code path, except for the quota disable ioctl path, we do the opposite:
      we start a transaction and then lock the mutex.
      
      So fix this by making the quota enable and disable paths to start the
      transaction without having the mutex locked, and then, after starting the
      transaction, lock the mutex and check if some other task already enabled
      or disabled the quotas, bailing with success if that was the case.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a855fbe6
    • Filipe Manana's avatar
      btrfs: do nofs allocations when adding and removing qgroup relations · 7aa6d359
      Filipe Manana authored
      
      When adding or removing a qgroup relation we are doing a GFP_KERNEL
      allocation which is not safe because we are holding a transaction
      handle open and that can make us deadlock if the allocator needs to
      recurse into the filesystem. So just surround those calls with a
      nofs context.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7aa6d359
    • Filipe Manana's avatar
      btrfs: fix lockdep splat when reading qgroup config on mount · 3d05cad3
      Filipe Manana authored
      
      Lockdep reported the following splat when running test btrfs/190 from
      fstests:
      
        [ 9482.126098] ======================================================
        [ 9482.126184] WARNING: possible circular locking dependency detected
        [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
        [ 9482.126365] ------------------------------------------------------
        [ 9482.126456] mount/24187 is trying to acquire lock:
        [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.126647]
      		 but task is already holding lock:
        [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.126886]
      		 which lock already depends on the new lock.
      
        [ 9482.127078]
      		 the existing dependency chain (in reverse order) is:
        [ 9482.127213]
      		 -> #1 (btrfs-quota-00){++++}-{3:3}:
        [ 9482.127366]        lock_acquire+0xd8/0x490
        [ 9482.127436]        down_read_nested+0x45/0x220
        [ 9482.127528]        __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.127613]        btrfs_read_lock_root_node+0x41/0x130 [btrfs]
        [ 9482.127702]        btrfs_search_slot+0x514/0xc30 [btrfs]
        [ 9482.127788]        update_qgroup_status_item+0x72/0x140 [btrfs]
        [ 9482.127877]        btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
        [ 9482.127964]        btrfs_work_helper+0xf1/0x600 [btrfs]
        [ 9482.128039]        process_one_work+0x24e/0x5e0
        [ 9482.128110]        worker_thread+0x50/0x3b0
        [ 9482.128181]        kthread+0x153/0x170
        [ 9482.128256]        ret_from_fork+0x22/0x30
        [ 9482.128327]
      		 -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
        [ 9482.128464]        check_prev_add+0x91/0xc60
        [ 9482.128551]        __lock_acquire+0x1740/0x3110
        [ 9482.128623]        lock_acquire+0xd8/0x490
        [ 9482.130029]        __mutex_lock+0xa3/0xb30
        [ 9482.130590]        qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.131577]        btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
        [ 9482.132175]        open_ctree+0x1228/0x18a0 [btrfs]
        [ 9482.132756]        btrfs_mount_root.cold+0x13/0xed [btrfs]
        [ 9482.133325]        legacy_get_tree+0x30/0x60
        [ 9482.133866]        vfs_get_tree+0x28/0xe0
        [ 9482.134392]        fc_mount+0xe/0x40
        [ 9482.134908]        vfs_kern_mount.part.0+0x71/0x90
        [ 9482.135428]        btrfs_mount+0x13b/0x3e0 [btrfs]
        [ 9482.135942]        legacy_get_tree+0x30/0x60
        [ 9482.136444]        vfs_get_tree+0x28/0xe0
        [ 9482.136949]        path_mount+0x2d7/0xa70
        [ 9482.137438]        do_mount+0x75/0x90
        [ 9482.137923]        __x64_sys_mount+0x8e/0xd0
        [ 9482.138400]        do_syscall_64+0x33/0x80
        [ 9482.138873]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 9482.139346]
      		 other info that might help us debug this:
      
        [ 9482.140735]  Possible unsafe locking scenario:
      
        [ 9482.141594]        CPU0                    CPU1
        [ 9482.142011]        ----                    ----
        [ 9482.142411]   lock(btrfs-quota-00);
        [ 9482.142806]                                lock(&fs_info->qgroup_rescan_lock);
        [ 9482.143216]                                lock(btrfs-quota-00);
        [ 9482.143629]   lock(&fs_info->qgroup_rescan_lock);
        [ 9482.144056]
      		  *** DEADLOCK ***
      
        [ 9482.145242] 2 locks held by mount/24187:
        [ 9482.145637]  #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
        [ 9482.146061]  #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.146509]
      		 stack backtrace:
        [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
        [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [ 9482.148709] Call Trace:
        [ 9482.149169]  dump_stack+0x8d/0xb5
        [ 9482.149628]  check_noncircular+0xff/0x110
        [ 9482.150090]  check_prev_add+0x91/0xc60
        [ 9482.150561]  ? kvm_clock_read+0x14/0x30
        [ 9482.151017]  ? kvm_sched_clock_read+0x5/0x10
        [ 9482.151470]  __lock_acquire+0x1740/0x3110
        [ 9482.151941]  ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.152402]  lock_acquire+0xd8/0x490
        [ 9482.152887]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.153354]  __mutex_lock+0xa3/0xb30
        [ 9482.153826]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.154301]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.154768]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.155226]  qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.155690]  btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
        [ 9482.156160]  open_ctree+0x1228/0x18a0 [btrfs]
        [ 9482.156643]  btrfs_mount_root.cold+0x13/0xed [btrfs]
        [ 9482.157108]  ? rcu_read_lock_sched_held+0x5d/0x90
        [ 9482.157567]  ? kfree+0x31f/0x3e0
        [ 9482.158030]  legacy_get_tree+0x30/0x60
        [ 9482.158489]  vfs_get_tree+0x28/0xe0
        [ 9482.158947]  fc_mount+0xe/0x40
        [ 9482.159403]  vfs_kern_mount.part.0+0x71/0x90
        [ 9482.159875]  btrfs_mount+0x13b/0x3e0 [btrfs]
        [ 9482.160335]  ? rcu_read_lock_sched_held+0x5d/0x90
        [ 9482.160805]  ? kfree+0x31f/0x3e0
        [ 9482.161260]  ? legacy_get_tree+0x30/0x60
        [ 9482.161714]  legacy_get_tree+0x30/0x60
        [ 9482.162166]  vfs_get_tree+0x28/0xe0
        [ 9482.162616]  path_mount+0x2d7/0xa70
        [ 9482.163070]  do_mount+0x75/0x90
        [ 9482.163525]  __x64_sys_mount+0x8e/0xd0
        [ 9482.163986]  do_syscall_64+0x33/0x80
        [ 9482.164437]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 9482.164902] RIP: 0033:0x7f51e907caaa
      
      This happens because at btrfs_read_qgroup_config() we can call
      qgroup_rescan_init() while holding a read lock on a quota btree leaf,
      acquired by the previous call to btrfs_search_slot_for_read(), and
      qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.
      
      A qgroup rescan worker does the opposite: it acquires the mutex
      qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
      update the qgroup status item in the quota btree through the call to
      update_qgroup_status_item(). This inversion of locking order
      between the qgroup_rescan_lock mutex and quota btree locks causes the
      splat.
      
      Fix this simply by releasing and freeing the path before calling
      qgroup_rescan_init() at btrfs_read_qgroup_config().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d05cad3
    • David Sterba's avatar
      btrfs: tree-checker: add missing returns after data_ref alignment checks · 6d06b0ad
      David Sterba authored
      
      There are sectorsize alignment checks that are reported but then
      check_extent_data_ref continues. This was not intended, wrong alignment
      is not a minor problem and we should return with error.
      
      CC: stable@vger.kernel.org # 5.4+
      Fixes: 0785a9aa ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d06b0ad
    • Johannes Thumshirn's avatar
      btrfs: don't access possibly stale fs_info data for printing duplicate device · 0697d9a6
      Johannes Thumshirn authored
      Syzbot reported a possible use-after-free when printing a duplicate device
      warning device_list_add().
      
      At this point it can happen that a btrfs_device::fs_info is not correctly
      setup yet, so we're accessing stale data, when printing the warning
      message using the btrfs_printk() wrappers.
      
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
        Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
      
        CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1d6/0x29e lib/dump_stack.c:118
         print_address_description+0x66/0x620 mm/kasan/report.c:383
         __kasan_report mm/kasan/report.c:513 [inline]
         kasan_report+0x132/0x1d0 mm/kasan/report.c:530
         btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
         device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
         btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
         btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x44840a
        RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
        RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
        RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
        R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
      
        Allocated by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track mm/kasan/common.c:56 [inline]
         __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
         kmalloc_node include/linux/slab.h:577 [inline]
         kvmalloc_node+0x81/0x110 mm/util.c:574
         kvmalloc include/linux/mm.h:757 [inline]
         kvzalloc include/linux/mm.h:765 [inline]
         btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
         kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
         __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
         __cache_free mm/slab.c:3418 [inline]
         kfree+0x113/0x200 mm/slab.c:3756
         deactivate_locked_super+0xa7/0xf0 fs/super.c:335
         btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff8880878e0000
         which belongs to the cache kmalloc-16k of size 16384
        The buggy address is located 1704 bytes inside of
         16384-byte region [ffff8880878e0000, ffff8880878e4000)
        The buggy address belongs to the page:
        page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
        head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
        flags: 0xfffe0000010200(slab|head)
        raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
        raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      				    ^
         ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      The syzkaller reproducer for this use-after-free crafts a filesystem image
      and loop mounts it twice in a loop. The mount will fail as the crafted
      image has an invalid chunk tree. When this happens btrfs_mount_root() will
      call deactivate_locked_super(), which then cleans up fs_info and
      fs_info::sb. If a second thread now adds the same block-device to the
      filesystem, it will get detected as a duplicate device and
      device_list_add() will reject the duplicate and print a warning. But as
      the fs_info pointer passed in is non-NULL this will result in a
      use-after-free.
      
      Instead of printing possibly uninitialized or already freed memory in
      btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
      device name will be skipped altogether.
      
      There was a slightly different approach discussed in
      https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
      
      
      Reported-by: default avatar <syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com>
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0697d9a6
  5. Nov 22, 2020
    • David Howells's avatar
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells authored
      
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
    • Yicong Yang's avatar
      libfs: fix error cast of negative value in simple_attr_write() · 488dac0c
      Yicong Yang authored
      
      The attr->set() receive a value of u64, but simple_strtoll() is used for
      doing the conversion.  It will lead to the error cast if user inputs a
      negative value.
      
      Use kstrtoull() instead of simple_strtoll() to convert a string got from
      the user to an unsigned value.  The former will return '-EINVAL' if it
      gets a negetive value, but the latter can't handle the situation
      correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
      this will eliminate the compile warning on no 64-bit architectures.
      
      Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
      Signed-off-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      488dac0c
  6. Nov 20, 2020
  7. Nov 19, 2020
    • Darrick J. Wong's avatar
      xfs: revert "xfs: fix rmap key and record comparison functions" · eb840907
      Darrick J. Wong authored
      
      This reverts commit 6ff646b2.
      
      Your maintainer committed a major braino in the rmap code by adding the
      attr fork, bmbt, and unwritten extent usage bits into rmap record key
      comparisons.  While XFS uses the usage bits *in the rmap records* for
      cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
      the owner and offset information to distinguish between reverse mappings
      of the same physical extent into the data fork of a file at multiple
      offsets.  The other bits are not important for key comparisons for index
      lookups, and never have been.
      
      Eric Sandeen reports that this causes regressions in generic/299, so
      undo this patch before it does more damage.
      
      Reported-by: default avatarEric Sandeen <sandeen@sandeen.net>
      Fixes: 6ff646b2 ("xfs: fix rmap key and record comparison functions")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      eb840907
    • Theodore Ts'o's avatar
      ext4: drop fast_commit from /proc/mounts · 704c2317
      Theodore Ts'o authored
      
      The options in /proc/mounts must be valid mount options --- and
      fast_commit is not a mount option.  Otherwise, command sequences like
      this will fail:
      
          # mount /dev/vdc /vdc
          # mkdir -p /vdc/phoronix_test_suite /pts
          # mount --bind /vdc/phoronix_test_suite /pts
          # mount -o remount,nodioread_nolock /pts
          mount: /pts: mount point not mounted or bad option.
      
      And in the system logs, you'll find:
      
          EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value
      
      Fixes: 995a3ed6 ("ext4: add fast_commit feature and handling for extended mount options")
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      704c2317
    • Dave Chinner's avatar
      xfs: don't allow NOWAIT DIO across extent boundaries · 883a790a
      Dave Chinner authored
      
      Jens has reported a situation where partial direct IOs can be issued
      and completed yet still return -EAGAIN. We don't want this to report
      a short IO as we want XFS to complete user DIO entirely or not at
      all.
      
      This partial IO situation can occur on a write IO that is split
      across an allocated extent and a hole, and the second mapping is
      returning EAGAIN because allocation would be required.
      
      The trivial reproducer:
      
      $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
      wrote 4096/4096 bytes at offset 0
      4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
      pwrite: Resource temporarily unavailable
      $
      
      The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
      the first 4kB write:
      
       xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
       iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
       xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
       iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
      
      Here the first iomap loop has mapped the first 4kB of the file and
      issued the IO, and we enter the second iomap_apply loop:
      
       iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
      
      And we exit with -EAGAIN out because we hit the allocate case trying
      to make the second 4kB block.
      
      Then IO completes on the first 4kB and the original IO context
      completes and unlocks the inode, returning -EAGAIN to userspace:
      
       xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
       xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
      
      There are other vectors to the same problem when we re-enter the
      mapping code if we have to make multiple mappinfs under NOWAIT
      conditions. e.g. failing trylocks, COW extents being found,
      allocation being required, and so on.
      
      Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
      to go ahead if the mapping we retrieve for the IO spans an entire
      allocated extent. This avoids the possibility of subsequent mappings
      to complete the IO from triggering NOWAIT semantics by any means as
      NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
      
      Reported-and-tested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      883a790a
  8. Nov 18, 2020
  9. Nov 17, 2020
  10. Nov 16, 2020
    • Rohith Surabattula's avatar
      smb3: Handle error case during offload read path · 12541000
      Rohith Surabattula authored
      
      Mid callback needs to be called only when valid data is
      read into pages.
      
      These patches address a problem found during decryption offload:
            CIFS: VFS: trying to dequeue a deleted mid
      that could cause a refcount use after free:
            Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
      
      Signed-off-by: default avatarRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org> #5.4+
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      12541000
    • Rohith Surabattula's avatar
      smb3: Avoid Mid pending list corruption · ac873aa3
      Rohith Surabattula authored
      
      When reconnect happens Mid queue can be corrupted when both
      demultiplex and offload thread try to dequeue the MID from the
      pending list.
      
      These patches address a problem found during decryption offload:
               CIFS: VFS: trying to dequeue a deleted mid
      that could cause a refcount use after free:
               Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
      
      Signed-off-by: default avatarRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org> #5.4+
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      ac873aa3
    • Rohith Surabattula's avatar
      smb3: Call cifs reconnect from demultiplex thread · de9ac0a6
      Rohith Surabattula authored
      
      cifs_reconnect needs to be called only from demultiplex thread.
      skip cifs_reconnect in offload thread. So, cifs_reconnect will be
      called by demultiplex thread in subsequent request.
      
      These patches address a problem found during decryption offload:
           CIFS: VFS: trying to dequeue a deleted mid
      that can cause a refcount use after free:
      
      [ 1271.389453] Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
      [ 1271.389456] RIP: 0010:refcount_warn_saturate+0xae/0xf0
      [ 1271.389457] Code: fa 1d 6a 01 01 e8 c7 44 b1 ff 0f 0b 5d c3 80 3d e7 1d 6a 01 00 75 91 48 c7 c7 d8 be 1d a2 c6 05 d7 1d 6a 01 01 e8 a7 44 b1 ff <0f> 0b 5d c3 80 3d c5 1d 6a 01 00 0f 85 6d ff ff ff 48 c7 c7 30 bf
      [ 1271.389458] RSP: 0018:ffffa4cdc1f87e30 EFLAGS: 00010286
      [ 1271.389458] RAX: 0000000000000000 RBX: ffff9974d2809f00 RCX: ffff9974df898cc8
      [ 1271.389459] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9974df898cc0
      [ 1271.389460] RBP: ffffa4cdc1f87e30 R08: 0000000000000004 R09: 00000000000002c0
      [ 1271.389460] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9974b7fdb5c0
      [ 1271.389461] R13: ffff9974d2809f00 R14: ffff9974ccea0a80 R15: ffff99748e60db80
      [ 1271.389462] FS:  0000000000000000(0000) GS:ffff9974df880000(0000) knlGS:0000000000000000
      [ 1271.389462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1271.389463] CR2: 000055c60f344fe4 CR3: 0000001031a3c002 CR4: 00000000003706e0
      [ 1271.389465] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1271.389465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1271.389466] Call Trace:
      [ 1271.389483]  cifs_mid_q_entry_release+0xce/0x110 [cifs]
      [ 1271.389499]  smb2_decrypt_offload+0xa9/0x1c0 [cifs]
      [ 1271.389501]  process_one_work+0x1e8/0x3b0
      [ 1271.389503]  worker_thread+0x50/0x370
      [ 1271.389504]  kthread+0x12f/0x150
      [ 1271.389506]  ? process_one_work+0x3b0/0x3b0
      [ 1271.389507]  ? __kthread_bind_mask+0x70/0x70
      [ 1271.389509]  ret_from_fork+0x22/0x30
      
      Signed-off-by: default avatarRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org> #5.4+
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      de9ac0a6
    • Namjae Jeon's avatar
      cifs: fix a memleak with modefromsid · 98128572
      Namjae Jeon authored
      
      kmemleak reported a memory leak allocated in query_info() when cifs is
      working with modefromsid.
      
        backtrace:
          [<00000000aeef6a1e>] slab_post_alloc_hook+0x58/0x510
          [<00000000b2f7a440>] __kmalloc+0x1a0/0x390
          [<000000006d470ebc>] query_info+0x5b5/0x700 [cifs]
          [<00000000bad76ce0>] SMB2_query_acl+0x2b/0x30 [cifs]
          [<000000001fa09606>] get_smb2_acl_by_path+0x2f3/0x720 [cifs]
          [<000000001b6ebab7>] get_smb2_acl+0x75/0x90 [cifs]
          [<00000000abf43904>] cifs_acl_to_fattr+0x13b/0x1d0 [cifs]
          [<00000000a5372ec3>] cifs_get_inode_info+0x4cd/0x9a0 [cifs]
          [<00000000388e0a04>] cifs_revalidate_dentry_attr+0x1cd/0x510 [cifs]
          [<0000000046b6b352>] cifs_getattr+0x8a/0x260 [cifs]
          [<000000007692c95e>] vfs_getattr_nosec+0xa1/0xc0
          [<00000000cbc7d742>] vfs_getattr+0x36/0x40
          [<00000000de8acf67>] vfs_statx_fd+0x4a/0x80
          [<00000000a58c6adb>] __do_sys_newfstat+0x31/0x70
          [<00000000300b3b4e>] __x64_sys_newfstat+0x16/0x20
          [<000000006d8e9c48>] do_syscall_64+0x37/0x80
      
      This patch add missing kfree for pntsd when mounting modefromsid option.
      
      Cc: Stable <stable@vger.kernel.org> # v5.4+
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      Reviewed-by: default avatarAurelien Aptel <aaptel@suse.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      98128572
  11. Nov 14, 2020
    • David Howells's avatar
      afs: Fix afs_write_end() when called with copied == 0 [ver #3] · 3ad216ee
      David Howells authored
      
      When afs_write_end() is called with copied == 0, it tries to set the
      dirty region, but there's no way to actually encode a 0-length region in
      the encoding in page->private.
      
      "0,0", for example, indicates a 1-byte region at offset 0.  The maths
      miscalculates this and sets it incorrectly.
      
      Fix it to just do nothing but unlock and put the page in this case.  We
      don't actually need to mark the page dirty as nothing presumably
      changed.
      
      Fixes: 65dd2d60 ("afs: Alter dirty range encoding in page->private")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ad216ee
    • Wengang Wang's avatar
      ocfs2: initialize ip_next_orphan · f5785283
      Wengang Wang authored
      
      Though problem if found on a lower 4.1.12 kernel, I think upstream has
      same issue.
      
      In one node in the cluster, there is the following callback trace:
      
         # cat /proc/21473/stack
         __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
         ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
         ocfs2_evict_inode+0x152/0x820 [ocfs2]
         evict+0xae/0x1a0
         iput+0x1c6/0x230
         ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
         ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
         ocfs2_dir_foreach+0x29/0x30 [ocfs2]
         ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
         ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
         process_one_work+0x169/0x4a0
         worker_thread+0x5b/0x560
         kthread+0xcb/0xf0
         ret_from_fork+0x61/0x90
      
      The above stack is not reasonable, the final iput shouldn't happen in
      ocfs2_orphan_filldir() function.  Looking at the code,
      
        2067         /* Skip inodes which are already added to recover list, since dio may
        2068          * happen concurrently with unlink/rename */
        2069         if (OCFS2_I(iter)->ip_next_orphan) {
        2070                 iput(iter);
        2071                 return 0;
        2072         }
        2073
      
      The logic thinks the inode is already in recover list on seeing
      ip_next_orphan is non-NULL, so it skip this inode after dropping a
      reference which incremented in ocfs2_iget().
      
      While, if the inode is already in recover list, it should have another
      reference and the iput() at line 2070 should not be the final iput
      (dropping the last reference).  So I don't think the inode is really in
      the recover list (no vmcore to confirm).
      
      Note that ocfs2_queue_orphans(), though not shown up in the call back
      trace, is holding cluster lock on the orphan directory when looking up
      for unlinked inodes.  The on disk inode eviction could involve a lot of
      IOs which may need long time to finish.  That means this node could hold
      the cluster lock for very long time, that can lead to the lock requests
      (from other nodes) to the orhpan directory hang for long time.
      
      Looking at more on ip_next_orphan, I found it's not initialized when
      allocating a new ocfs2_inode_info structure.
      
      This causes te reflink operations from some nodes hang for very long
      time waiting for the cluster lock on the orphan directory.
      
      Fix: initialize ip_next_orphan as NULL.
      
      Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5785283
    • Jens Axboe's avatar
      io_uring: handle -EOPNOTSUPP on path resolution · 944d1444
      Jens Axboe authored
      
      Any attempt to do path resolution on /proc/self from an async worker will
      yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
      and without blocking, so retry it from there.
      
      Ideally io_uring would know this upfront and not have to go through the
      worker thread to find out, but that doesn't currently seem feasible.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      944d1444
  12. Nov 13, 2020
    • Jens Axboe's avatar
      proc: don't allow async path resolution of /proc/self components · 8d4c3e76
      Jens Axboe authored
      
      If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
      currently support that. Once we can get task_pid_ptr() doing the right
      thing, then this can go away again.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8d4c3e76
    • Daniel Xu's avatar
      btrfs: tree-checker: add missing return after error in root_item · 1a49a97d
      Daniel Xu authored
      There's a missing return statement after an error is found in the
      root_item, this can cause further problems when a crafted image triggers
      the error.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
      
      
      Fixes: 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a49a97d
    • Qu Wenruo's avatar
      btrfs: qgroup: don't commit transaction when we already hold the handle · 6f23277a
      Qu Wenruo authored
      [BUG]
      When running the following script, btrfs will trigger an ASSERT():
      
        #/bin/bash
        mkfs.btrfs -f $dev
        mount $dev $mnt
        xfs_io -f -c "pwrite 0 1G" $mnt/file
        sync
        btrfs quota enable $mnt
        btrfs quota rescan -w $mnt
      
        # Manually set the limit below current usage
        btrfs qgroup limit 512M $mnt $mnt
      
        # Crash happens
        touch $mnt/file
      
      The dmesg looks like this:
      
        assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3230!
        invalid opcode: 0000 [#1] SMP PTI
        RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
         btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
         try_flush_qgroup+0x67/0x100 [btrfs]
         __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
         btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
         btrfs_update_inode+0x9d/0x110 [btrfs]
         btrfs_dirty_inode+0x5d/0xd0 [btrfs]
         touch_atime+0xb5/0x100
         iterate_dir+0xf1/0x1b0
         __x64_sys_getdents64+0x78/0x110
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fb5afe588db
      
      [CAUSE]
      In try_flush_qgroup(), we assume we don't hold a transaction handle at
      all.  This is true for data reservation and mostly true for metadata.
      Since data space reservation always happens before we start a
      transaction, and for most metadata operation we reserve space in
      start_transaction().
      
      But there is an exception, btrfs_delayed_inode_reserve_metadata().
      It holds a transaction handle, while still trying to reserve extra
      metadata space.
      
      When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
      will join current transaction and commit, while we still have
      transaction handle from qgroup code.
      
      [FIX]
      Let's check current->journal before we join the transaction.
      
      If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
      we are not holding a transaction, thus are able to join and then commit
      transaction.
      
      If current->journal is a valid transaction handle, we avoid committing
      transaction and just end it
      
      This is less effective than committing current transaction, as it won't
      free metadata reserved space, but we may still free some data space
      before new data writes.
      
      Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
      
      
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f23277a
    • Filipe Manana's avatar
      btrfs: fix missing delalloc new bit for new delalloc ranges · c3347309
      Filipe Manana authored
      When doing a buffered write, through one of the write family syscalls, we
      look for ranges which currently don't have allocated extents and set the
      'delalloc new' bit on them, so that we can report a correct number of used
      blocks to the stat(2) syscall until delalloc is flushed and ordered extents
      complete.
      
      However there are a few other places where we can do a buffered write
      against a range that is mapped to a hole (no extent allocated) and where
      we do not set the 'new delalloc' bit. Those places are:
      
      - Doing a memory mapped write against a hole;
      
      - Cloning an inline extent into a hole starting at file offset 0;
      
      - Calling btrfs_cont_expand() when the i_size of the file is not aligned
        to the sector size and is located in a hole. For example when cloning
        to a destination offset beyond EOF.
      
      So after such cases, until the corresponding delalloc range is flushed and
      the respective ordered extents complete, we can report an incorrect number
      of blocks used through the stat(2) syscall.
      
      In some cases we can end up reporting 0 used blocks to stat(2), which is a
      particular bad value to report as it may mislead tools to think a file is
      completely sparse when its i_size is not zero, making them skip reading
      any data, an undesired consequence for tools such as archivers and other
      backup tools, as reported a long time ago in the following thread (and
      other past threads):
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -c "truncate 64K"   \
            -c "mmap -w 0 64K"        \
            -c "mwrite -S 0xab 0 64K" \
            -c "munmap"               \
            $MNT/foo
      
        blocks_used=$(stat -c %b $MNT/foo)
        echo "blocks used: $blocks_used"
      
        if [ $blocks_used -eq 0 ]; then
            echo "ERROR: blocks used is 0"
        fi
      
        umount $DEV
      
        $ ./reproducer.sh
        blocks used: 0
        ERROR: blocks used is 0
      
      So move the logic that decides to set the 'delalloc bit' bit into the
      function btrfs_set_extent_delalloc(), since that is what we use for all
      those missing cases as well as for the cases that currently work well.
      
      This change is also preparatory work for an upcoming patch that fixes
      other problems related to tracking and reporting the number of bytes used
      by an inode.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3347309
  13. Nov 12, 2020
    • Bob Peterson's avatar
      gfs2: Fix case in which ail writes are done to jdata holes · 4e79e3f0
      Bob Peterson authored
      
      Patch b2a846db ("gfs2: Ignore journal log writes for jdata holes")
      tried (unsuccessfully) to fix a case in which writes were done to jdata
      blocks, the blocks are sent to the ail list, then a punch_hole or truncate
      operation caused the blocks to be freed. In other words, the ail items
      are for jdata holes. Before b2a846db, the jdata hole caused function
      gfs2_block_map to return -EIO, which was eventually interpreted as an
      IO error to the journal, and then withdraw.
      
      This patch changes function gfs2_get_block_noalloc, which is only used
      for jdata writes, so it returns -ENODATA rather than -EIO, and when
      -ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
      We can safely ignore it because gfs2_ail1_start_one is only called
      when the jdata pages have already been written and truncated, so the
      ail1 content no longer applies.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      4e79e3f0
    • Bob Peterson's avatar
      Revert "gfs2: Ignore journal log writes for jdata holes" · d3039c06
      Bob Peterson authored
      
      This reverts commit b2a846db.
      
      That commit changed the behavior of function gfs2_block_map to return
      -ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
      false.  While that fixed the intended problem for jdata, it also broke
      other callers of gfs2_block_map such as some jdata block reads.  Before
      the patch, an encountered hole would be skipped and the buffer seen as
      unmapped by the caller.  The patch changed the behavior to return
      -ENODATA, which is interpreted as an error by the caller.
      
      The -ENODATA return code should be restricted to the specific case where
      jdata holes are encountered during ail1 writes.  That will be done in a
      later patch.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      d3039c06
    • Trond Myklebust's avatar
      NFS: Remove unnecessary inode lock in nfs_fsync_dir() · 11decaf8
      Trond Myklebust authored
      
      nfs_inc_stats() is already thread-safe, and there are no other reasons
      to hold the inode lock here.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      11decaf8
Loading