Skip to content
Snippets Groups Projects
  1. Dec 07, 2021
    • Jeffle Xu's avatar
      netfs: fix parameter of cleanup() · 3cfef1b6
      Jeffle Xu authored
      
      The order of these two parameters is just reversed. gcc didn't warn on
      that, probably because 'void *' can be converted from or to other
      pointer types without warning.
      
      Cc: stable@vger.kernel.org
      Fixes: 3d3c9504 ("netfs: Provide readahead and readpage netfs helpers")
      Fixes: e1b1240c ("netfs: Add write_begin helper")
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Link: https://lore.kernel.org/r/20211207031449.100510-1-jefflexu@linux.alibaba.com/ # v1
      3cfef1b6
    • David Howells's avatar
      netfs: Fix lockdep warning from taking sb_writers whilst holding mmap_lock · 598ad0bd
      David Howells authored
      
      Taking sb_writers whilst holding mmap_lock isn't allowed and will result in
      a lockdep warning like that below.  The problem comes from cachefiles
      needing to take the sb_writers lock in order to do a write to the cache,
      but being asked to do this by netfslib called from readpage, readahead or
      write_begin[1].
      
      Fix this by always offloading the write to the cache off to a worker
      thread.  The main thread doesn't need to wait for it, so deadlock can be
      avoided.
      
      This can be tested by running the quick xfstests on something like afs or
      ceph with lockdep enabled.
      
      WARNING: possible circular locking dependency detected
      5.15.0-rc1-build2+ #292 Not tainted
      ------------------------------------------------------
      holetest/65517 is trying to acquire lock:
      ffff88810c81d730 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_fault+0x276/0x7a5
      
      but task is already holding lock:
      ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
      
      which lock already depends on the new lock.
      
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&mm->mmap_lock#2){++++}-{3:3}:
             validate_chain+0x3c4/0x4a8
             __lock_acquire+0x89d/0x949
             lock_acquire+0x2dc/0x34b
             __might_fault+0x87/0xb1
             strncpy_from_user+0x25/0x18c
             removexattr+0x7c/0xe5
             __do_sys_fremovexattr+0x73/0x96
             do_syscall_64+0x67/0x7a
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #1 (sb_writers#10){.+.+}-{0:0}:
             validate_chain+0x3c4/0x4a8
             __lock_acquire+0x89d/0x949
             lock_acquire+0x2dc/0x34b
             cachefiles_write+0x2b3/0x4bb
             netfs_rreq_do_write_to_cache+0x3b5/0x432
             netfs_readpage+0x2de/0x39d
             filemap_read_page+0x51/0x94
             filemap_get_pages+0x26f/0x413
             filemap_read+0x182/0x427
             new_sync_read+0xf0/0x161
             vfs_read+0x118/0x16e
             ksys_read+0xb8/0x12e
             do_syscall_64+0x67/0x7a
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #0 (mapping.invalidate_lock#3){.+.+}-{3:3}:
             check_noncircular+0xe4/0x129
             check_prev_add+0x16b/0x3a4
             validate_chain+0x3c4/0x4a8
             __lock_acquire+0x89d/0x949
             lock_acquire+0x2dc/0x34b
             down_read+0x40/0x4a
             filemap_fault+0x276/0x7a5
             __do_fault+0x96/0xbf
             do_fault+0x262/0x35a
             __handle_mm_fault+0x171/0x1b5
             handle_mm_fault+0x12a/0x233
             do_user_addr_fault+0x3d2/0x59c
             exc_page_fault+0x85/0xa5
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        mapping.invalidate_lock#3 --> sb_writers#10 --> &mm->mmap_lock#2
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&mm->mmap_lock#2);
                                     lock(sb_writers#10);
                                     lock(&mm->mmap_lock#2);
        lock(mapping.invalidate_lock#3);
      
       *** DEADLOCK ***
      
      1 lock held by holetest/65517:
       #0: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
      
      stack backtrace:
      CPU: 0 PID: 65517 Comm: holetest Not tainted 5.15.0-rc1-build2+ #292
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Call Trace:
       dump_stack_lvl+0x45/0x59
       check_noncircular+0xe4/0x129
       ? print_circular_bug+0x207/0x207
       ? validate_chain+0x461/0x4a8
       ? add_chain_block+0x88/0xd9
       ? hlist_add_head_rcu+0x49/0x53
       check_prev_add+0x16b/0x3a4
       validate_chain+0x3c4/0x4a8
       ? check_prev_add+0x3a4/0x3a4
       ? mark_lock+0xa5/0x1c6
       __lock_acquire+0x89d/0x949
       lock_acquire+0x2dc/0x34b
       ? filemap_fault+0x276/0x7a5
       ? rcu_read_unlock+0x59/0x59
       ? add_to_page_cache_lru+0x13c/0x13c
       ? lock_is_held_type+0x7b/0xd3
       down_read+0x40/0x4a
       ? filemap_fault+0x276/0x7a5
       filemap_fault+0x276/0x7a5
       ? pagecache_get_page+0x2dd/0x2dd
       ? __lock_acquire+0x8bc/0x949
       ? pte_offset_kernel.isra.0+0x6d/0xc3
       __do_fault+0x96/0xbf
       ? do_fault+0x124/0x35a
       do_fault+0x262/0x35a
       ? handle_pte_fault+0x1c1/0x20d
       __handle_mm_fault+0x171/0x1b5
       ? handle_pte_fault+0x20d/0x20d
       ? __lock_release+0x151/0x254
       ? mark_held_locks+0x1f/0x78
       ? rcu_read_unlock+0x3a/0x59
       handle_mm_fault+0x12a/0x233
       do_user_addr_fault+0x3d2/0x59c
       ? pgtable_bad+0x70/0x70
       ? rcu_read_lock_bh_held+0xab/0xab
       exc_page_fault+0x85/0xa5
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x40192f
      Code: ff 48 89 c3 48 8b 05 50 28 00 00 48 85 ed 7e 23 31 d2 4b 8d 0c 2f eb 0a 0f 1f 00 48 8b 05 39 28 00 00 48 0f af c2 48 83 c2 01 <48> 89 1c 01 48 39 d5 7f e8 8b 0d f2 27 00 00 31 c0 85 c9 74 0e 8b
      RSP: 002b:00007f9931867eb0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: 00007f9931868700 RCX: 00007f993206ac00
      RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007ffc13e06ee0
      RBP: 0000000000000100 R08: 0000000000000000 R09: 00007f9931868700
      R10: 00007f99318689d0 R11: 0000000000000202 R12: 00007ffc13e06ee0
      R13: 0000000000000c00 R14: 00007ffc13e06e00 R15: 00007f993206a000
      
      Fixes: 726218fd ("netfs: Define an interface to talk to a cache")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarJeff Layton <jlayton@kernel.org>
      cc: Jan Kara <jack@suse.cz>
      cc: linux-cachefs@redhat.com
      cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20210922110420.GA21576@quack2.suse.cz/ [1]
      Link: https://lore.kernel.org/r/163887597541.1596626.2668163316598972956.stgit@warthog.procyon.org.uk/ # v1
      598ad0bd
  2. Dec 03, 2021
  3. Dec 02, 2021
    • Andreas Gruenbacher's avatar
      gfs2: gfs2_create_inode rework · 3d36e57f
      Andreas Gruenbacher authored
      
      When gfs2_lookup_by_inum() calls gfs2_inode_lookup() for an uncached
      inode, gfs2_inode_lookup() will place a new tentative inode into the
      inode cache before verifying that there is a valid inode at the given
      address.  This can race with gfs2_create_inode() which doesn't check for
      duplicates inodes.  gfs2_create_inode() will try to assign the new inode
      to the corresponding inode glock, and glock_set_object() will complain
      that the glock is still in use by gfs2_inode_lookup's tentative inode.
      
      We noticed this bug after adding commit 486408d6 ("gfs2: Cancel
      remote delete work asynchronously") which allowed delete_work_func() to
      race with gfs2_create_inode(), but the same race exists for
      open-by-handle.
      
      Fix that by switching from insert_inode_hash() to
      insert_inode_locked4(), which does check for duplicate inodes.  We know
      we've just managed to to allocate the new inode, so an inode tentatively
      created by gfs2_inode_lookup() will eventually go away and
      insert_inode_locked4() will always succeed.
      
      In addition, don't flush the inode glock work anymore (this can now only
      make things worse) and clean up glock_{set,clear}_object for the inode
      glock somewhat.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      3d36e57f
    • Andreas Gruenbacher's avatar
      gfs2: gfs2_inode_lookup rework · 5f6e13ba
      Andreas Gruenbacher authored
      
      Rework gfs2_inode_lookup() to only set up the new inode's glocks after
      verifying that the new inode is valid.
      
      There is no need for flushing the inode glock work queue anymore now,
      so remove that as well.
      
      While at it, get rid of the useless wrapper around iget5_locked() and
      its unnecessary is_bad_inode() check.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      5f6e13ba
    • Andreas Gruenbacher's avatar
      gfs2: gfs2_inode_lookup cleanup · b8e12e35
      Andreas Gruenbacher authored
      
      In gfs2_inode_lookup, once the inode has been looked up, we check if the
      inode generation (no_formal_ino) is the one we're looking for.  If it
      isn't and the inode wasn't in the inode cache, we discard the newly
      looked up inode.  This is unnecessary, complicates the code, and makes
      future changes to gfs2_inode_lookup harder, so change the code to retain
      newly looked up inodes instead.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      b8e12e35
    • Andreas Gruenbacher's avatar
      gfs2: Fix remote demote of weak glock holders · e11b02df
      Andreas Gruenbacher authored
      
      When we mock up a temporary holder in gfs2_glock_cb to demote weak holders in
      response to a remote locking conflict, we don't set the HIF_HOLDER flag.  This
      causes function may_grant to BUG.  Fix by setting the missing HIF_HOLDER flag
      in the mock glock holder.
      
      In addition, define the mock glock holder where it is used.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      e11b02df
    • Eric Sandeen's avatar
      xfs: remove incorrect ASSERT in xfs_rename · e4459765
      Eric Sandeen authored
      
      This ASSERT in xfs_rename is a) incorrect, because
      (RENAME_WHITEOUT|RENAME_NOREPLACE) is a valid combination, and
      b) unnecessary, because actual invalid flag combinations are already
      handled at the vfs level in do_renameat2() before we get called.
      So, remove it.
      
      Reported-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e4459765
  4. Nov 29, 2021
  5. Nov 27, 2021
    • Guenter Roeck's avatar
      fs: ntfs: Limit NTFS_RW to page sizes smaller than 64k · 4eec7faf
      Guenter Roeck authored
      
      NTFS_RW code allocates page size dependent arrays on the stack. This
      results in build failures if the page size is 64k or larger.
      
        fs/ntfs/aops.c: In function 'ntfs_write_mst_block':
        fs/ntfs/aops.c:1311:1: error:
      	the frame size of 2240 bytes is larger than 2048 bytes
      
      Since commit f22969a6 ("powerpc/64s: Default to 64K pages for 64 bit
      book3s") this affects ppc:allmodconfig builds, but other architectures
      supporting page sizes of 64k or larger are also affected.
      
      Increasing the maximum frame size for affected architectures just to
      silence this error does not really help.  The frame size would have to
      be set to a really large value for 256k pages.  Also, a large frame size
      could potentially result in stack overruns in this code and elsewhere
      and is therefore not desirable.  Make NTFS_RW dependent on page sizes
      smaller than 64k instead.
      
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eec7faf
    • Ye Bin's avatar
      io_uring: Fix undefined-behaviour in io_issue_sqe · f6223ff7
      Ye Bin authored
      
      We got issue as follows:
      ================================================================================
      UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
      signed integer overflow:
      -4966321760114568020 * 1000000000 cannot be represented in type 'long long int'
      CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
       show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x170/0x1dc lib/dump_stack.c:118
       ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
       handle_overflow+0x188/0x1dc lib/ubsan.c:192
       __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
       ktime_set include/linux/ktime.h:42 [inline]
       timespec64_to_ktime include/linux/ktime.h:78 [inline]
       io_timeout fs/io_uring.c:5153 [inline]
       io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599
       __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988
       io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067
       io_submit_sqe fs/io_uring.c:6137 [inline]
       io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331
       __do_sys_io_uring_enter fs/io_uring.c:8170 [inline]
       __se_sys_io_uring_enter fs/io_uring.c:8129 [inline]
       __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129
       invoke_syscall arch/arm64/kernel/syscall.c:53 [inline]
       el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121
       el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190
       el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017
      ================================================================================
      
      As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass
      negative value maybe lead to overflow.
      To address this issue, we must check if 'sec' is negative.
      
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6223ff7
    • Ye Bin's avatar
      io_uring: fix soft lockup when call __io_remove_buffers · 1d0254e6
      Ye Bin authored
      
      I got issue as follows:
      [ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
      [  594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
      [  594.364987] Modules linked in:
      [  594.365405] irq event stamp: 604180238
      [  594.365906] hardirqs last  enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50
      [  594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0
      [  594.368420] softirqs last  enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e
      [  594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250
      [  594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G            L    5.15.0-next-20211112+ #88
      [  594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      [  594.373604] Workqueue: events_unbound io_ring_exit_work
      [  594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
      [  594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474
      [  594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202
      [  594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106
      [  594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001
      [  594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab
      [  594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0
      [  594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000
      [  594.382787] FS:  0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000
      [  594.383851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0
      [  594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  594.387403] Call Trace:
      [  594.387738]  <TASK>
      [  594.388042]  find_and_remove_object+0x118/0x160
      [  594.389321]  delete_object_full+0xc/0x20
      [  594.389852]  kfree+0x193/0x470
      [  594.390275]  __io_remove_buffers.part.0+0xed/0x147
      [  594.390931]  io_ring_ctx_free+0x342/0x6a2
      [  594.392159]  io_ring_exit_work+0x41e/0x486
      [  594.396419]  process_one_work+0x906/0x15a0
      [  594.399185]  worker_thread+0x8b/0xd80
      [  594.400259]  kthread+0x3bf/0x4a0
      [  594.401847]  ret_from_fork+0x22/0x30
      [  594.402343]  </TASK>
      
      Message from syslogd@localhost at Nov 13 09:09:54 ...
      kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
      [  596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
      
      We can reproduce this issue by follow syzkaller log:
      r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0)
      sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0)
      syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0)
      io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0)
      
      The reason above issue  is 'buf->list' has 2,100,000 nodes, occupied cpu lead
      to soft lockup.
      To solve this issue, we need add schedule point when do while loop in
      '__io_remove_buffers'.
      After add  schedule point we do regression, get follow data.
      [  240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
      [  268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
      [  275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
      [  296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
      [  305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
      [  325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00
      [  333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
      ...
      
      Fixes:8bab4c09("io_uring: allow conditional reschedule for intensive iterators")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d0254e6
  6. Nov 26, 2021
  7. Nov 25, 2021
  8. Nov 24, 2021
    • Andreas Gruenbacher's avatar
      iomap: iomap_read_inline_data cleanup · 5ad448ce
      Andreas Gruenbacher authored
      
      Change iomap_read_inline_data to return 0 or an error code; this
      simplifies the callers.  Add a description.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [djwong: document the return value of iomap_read_inline_data explicitly]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5ad448ce
    • Christoph Hellwig's avatar
      xfs: remove xfs_inew_wait · 1090427b
      Christoph Hellwig authored
      
      With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the
      infrastructure used to wake the XFS_INEW bit waitqueue is unused.
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Fixes: 777eb1fa ("xfs: remove xfs_dqrele_all_inodes")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1090427b
    • Yang Xu's avatar
      xfs: Fix the free logic of state in xfs_attr_node_hasname · a1de97fe
      Yang Xu authored
      
      When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine.
      Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.
      
      The deadlock as below:
      [983.923403] task:setfattr        state:D stack:    0 pid:17639 ppid: 14687 flags:0x00000080
      [  983.923405] Call Trace:
      [  983.923410]  __schedule+0x2c4/0x700
      [  983.923412]  schedule+0x37/0xa0
      [  983.923414]  schedule_timeout+0x274/0x300
      [  983.923416]  __down+0x9b/0xf0
      [  983.923451]  ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
      [  983.923453]  down+0x3b/0x50
      [  983.923471]  xfs_buf_lock+0x33/0xf0 [xfs]
      [  983.923490]  xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
      [  983.923508]  xfs_buf_get_map+0x4c/0x320 [xfs]
      [  983.923525]  xfs_buf_read_map+0x53/0x310 [xfs]
      [  983.923541]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
      [  983.923560]  xfs_trans_read_buf_map+0x1cf/0x360 [xfs]
      [  983.923575]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
      [  983.923590]  xfs_da_read_buf+0xcf/0x120 [xfs]
      [  983.923606]  xfs_da3_node_read+0x1f/0x40 [xfs]
      [  983.923621]  xfs_da3_node_lookup_int+0x69/0x4a0 [xfs]
      [  983.923624]  ? kmem_cache_alloc+0x12e/0x270
      [  983.923637]  xfs_attr_node_hasname+0x6e/0xa0 [xfs]
      [  983.923651]  xfs_has_attr+0x6e/0xd0 [xfs]
      [  983.923664]  xfs_attr_set+0x273/0x320 [xfs]
      [  983.923683]  xfs_xattr_set+0x87/0xd0 [xfs]
      [  983.923686]  __vfs_removexattr+0x4d/0x60
      [  983.923688]  __vfs_removexattr_locked+0xac/0x130
      [  983.923689]  vfs_removexattr+0x4e/0xf0
      [  983.923690]  removexattr+0x4d/0x80
      [  983.923693]  ? __check_object_size+0xa8/0x16b
      [  983.923695]  ? strncpy_from_user+0x47/0x1a0
      [  983.923696]  ? getname_flags+0x6a/0x1e0
      [  983.923697]  ? _cond_resched+0x15/0x30
      [  983.923699]  ? __sb_start_write+0x1e/0x70
      [  983.923700]  ? mnt_want_write+0x28/0x50
      [  983.923701]  path_removexattr+0x9b/0xb0
      [  983.923702]  __x64_sys_removexattr+0x17/0x20
      [  983.923704]  do_syscall_64+0x5b/0x1a0
      [  983.923705]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [  983.923707] RIP: 0033:0x7f080f10ee1b
      
      When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in
      xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it
      free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.
      
      Then subsequent removexattr will hang because of it.
      
      This bug was introduced by kernel commit 07120f1a ("xfs: Add xfs_has_attr and subroutines").
      It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state
      in this case. But xfs_attr_node_hasname will free state itself instead of caller if
      xfs_da3_node_lookup_int fails.
      
      Fix this bug by moving the step of free state into caller.
      
      Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and
      xfs_attr_node_removename_setup function because we should free state ourselves.
      
      Fixes: 07120f1a ("xfs: Add xfs_has_attr and subroutines")
      Signed-off-by: default avatarYang Xu <xuyang2018.jy@fujitsu.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a1de97fe
  9. Nov 23, 2021
  10. Nov 22, 2021
    • Andreas Gruenbacher's avatar
      iomap: Fix inline extent handling in iomap_readpage · d8af404f
      Andreas Gruenbacher authored
      
      Before commit 740499c7 ("iomap: fix the iomap_readpage_actor return
      value for inline data"), when hitting an IOMAP_INLINE extent,
      iomap_readpage_actor would report having read the entire page.  Since
      then, it only reports having read the inline data (iomap->length).
      
      This will force iomap_readpage into another iteration, and the
      filesystem will report an unaligned hole after the IOMAP_INLINE extent.
      But iomap_readpage_actor (now iomap_readpage_iter) isn't prepared to
      deal with unaligned extents, it will get things wrong on filesystems
      with a block size smaller than the page size, and we'll eventually run
      into the following warning in iomap_iter_advance:
      
        WARN_ON_ONCE(iter->processed > iomap_length(iter));
      
      Fix that by changing iomap_readpage_iter to return 0 when hitting an
      inline extent; this will cause iomap_iter to stop immediately.
      
      To fix readahead as well, change iomap_readahead_iter to pass on
      iomap_readpage_iter return values less than or equal to zero.
      
      Fixes: 740499c7 ("iomap: fix the iomap_readpage_actor return value for inline data")
      Cc: stable@vger.kernel.org # v5.15+
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d8af404f
  11. Nov 21, 2021
  12. Nov 20, 2021
    • David Hildenbrand's avatar
      proc/vmcore: fix clearing user buffer by properly using clear_user() · c1e63117
      David Hildenbrand authored
      To clear a user buffer we cannot simply use memset, we have to use
      clear_user().  With a virtio-mem device that registers a vmcore_cb and
      has some logically unplugged memory inside an added Linux memory block,
      I can easily trigger a BUG by copying the vmcore via "cp":
      
        systemd[1]: Starting Kdump Vmcore Save Service...
        kdump[420]: Kdump is using the default log level(3).
        kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
        kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
        kdump[465]: saving vmcore-dmesg.txt complete
        kdump[467]: saving vmcore
        BUG: unable to handle page fault for address: 00007f2374e01000
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0003) - permissions violation
        PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
        Oops: 0003 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
        RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
        Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
        RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
        RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
        RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
        RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
        R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
        R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
        FS:  00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
        Call Trace:
         read_vmcore+0x236/0x2c0
         proc_reg_read+0x55/0xa0
         vfs_read+0x95/0x190
         ksys_read+0x4f/0xc0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
      Prevention (SMAP)", which is used to detect wrong access from the kernel
      to user buffers like this: SMAP triggers a permissions violation on
      wrong access.  In the x86-64 variant of clear_user(), SMAP is properly
      handled via clac()+stac().
      
      To fix, properly use clear_user() when we're dealing with a user buffer.
      
      Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
      
      
      Fixes: 997c136f ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Philipp Rudo <prudo@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1e63117
  13. Nov 17, 2021
Loading