1. 15 Jan, 2019 1 commit
    • Ming Lei's avatar
      sbitmap: Protect swap_lock from hardirq · fe76fc6a
      Ming Lei authored
      Because we may call blk_mq_get_driver_tag() directly from
      blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may
      come and the above DEADLOCK is triggered.
      
      Commit ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") tries to
      fix this issue by using 'spin_lock_bh', which isn't enough because we
      complete request from hardirq context direclty in case of multiqueue.
      
      Cc: Clark Williams <williams@redhat.com>
      Fixes: ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq")
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe76fc6a
  2. 14 Jan, 2019 1 commit
  3. 20 Dec, 2018 1 commit
  4. 12 Dec, 2018 1 commit
  5. 10 Dec, 2018 1 commit
    • Jens Axboe's avatar
      sbitmap: silence bogus lockdep IRQ warning · 58ab5e32
      Jens Axboe authored
      Ming reports that lockdep spews the following trace. What this
      essentially says is that the sbitmap swap_lock was used inconsistently
      in IRQ enabled and disabled context, and that is usually indicative of a
      bug that will cause a deadlock.
      
      For this case, it's a false positive. The swap_lock is used from process
      context only, when we swap the bits in the word and cleared mask. We
      also end up doing that when we are getting a driver tag, from the
      blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
      IRQs disabled. However, this isn't from an actual IRQ, it's still
      process context.
      
      In lieu of a better way to fix this, simply always disable interrupts
      when grabbing the swap_lock if lockdep is enabled.
      
      [  100.967642] ================start test sanity/001================
      [  101.238280] null: module loaded
      [  106.093735]
      [  106.094012] =====================================================
      [  106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      [  106.095759] 4.20.0-rc3_5d2ee712_for-next+ #1 Not tainted
      [  106.096551] -----------------------------------------------------
      [  106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [  106.098231] 000000004c43fa71
      (&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
      [  106.099431]
      [  106.099431] and this task is already holding:
      [  106.100229] 000000007eec8b2f
      (&(&hctx->dispatch_wait_lock)->rlock){....}, at:
      blk_mq_dispatch_rq_list+0x4c1/0xd7c
      [  106.101630] which would create a new lock dependency:
      [  106.102326]  (&(&hctx->dispatch_wait_lock)->rlock){....} ->
      (&(&sb->map[i].swap_lock)->rlock){+.+.}
      [  106.103553]
      [  106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
      [  106.104580]  (&sbq->ws[i].wait){..-.}
      [  106.104582]
      [  106.104582] ... which became SOFTIRQ-irq-safe at:
      [  106.105751]   _raw_spin_lock_irqsave+0x4b/0x82
      [  106.106284]   __wake_up_common_lock+0x119/0x1b9
      [  106.106825]   sbitmap_queue_wake_up+0x33f/0x383
      [  106.107456]   sbitmap_queue_clear+0x4c/0x9a
      [  106.108046]   __blk_mq_free_request+0x188/0x1d3
      [  106.108581]   blk_mq_free_request+0x23b/0x26b
      [  106.109102]   scsi_end_request+0x345/0x5d7
      [  106.109587]   scsi_io_completion+0x4b5/0x8f0
      [  106.110099]   scsi_finish_command+0x412/0x456
      [  106.110615]   scsi_softirq_done+0x23f/0x29b
      [  106.111115]   blk_done_softirq+0x2a7/0x2e6
      [  106.111608]   __do_softirq+0x360/0x6ad
      [  106.112062]   run_ksoftirqd+0x2f/0x5b
      [  106.112499]   smpboot_thread_fn+0x3a5/0x3db
      [  106.113000]   kthread+0x1d4/0x1e4
      [  106.113457]   ret_from_fork+0x3a/0x50
      [  106.113969]
      [  106.113969] to a SOFTIRQ-irq-unsafe lock:
      [  106.114672]  (&(&sb->map[i].swap_lock)->rlock){+.+.}
      [  106.114674]
      [  106.114674] ... which became SOFTIRQ-irq-unsafe at:
      [  106.116000] ...
      [  106.116003]   _raw_spin_lock+0x33/0x64
      [  106.116676]   sbitmap_get+0xd5/0x22c
      [  106.117134]   __sbitmap_queue_get+0xe8/0x177
      [  106.117731]   __blk_mq_get_tag+0x1e6/0x22d
      [  106.118286]   blk_mq_get_tag+0x1db/0x6e4
      [  106.118756]   blk_mq_get_driver_tag+0x161/0x258
      [  106.119383]   blk_mq_dispatch_rq_list+0x28e/0xd7c
      [  106.120043]   blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.120607]   blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.121234]   __blk_mq_run_hw_queue+0x137/0x17e
      [  106.121781]   __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.122366]   blk_mq_run_hw_queue+0x151/0x187
      [  106.122887]   blk_mq_sched_insert_requests+0x13f/0x175
      [  106.123492]   blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.124042]   blk_flush_plug_list+0x392/0x3d7
      [  106.124557]   blk_finish_plug+0x37/0x4f
      [  106.125019]   read_pages+0x3ef/0x430
      [  106.125446]   __do_page_cache_readahead+0x18e/0x2fc
      [  106.126027]   force_page_cache_readahead+0x121/0x133
      [  106.126621]   page_cache_sync_readahead+0x35f/0x3bb
      [  106.127229]   generic_file_buffered_read+0x410/0x1860
      [  106.127932]   __vfs_read+0x319/0x38f
      [  106.128415]   vfs_read+0xd2/0x19a
      [  106.128817]   ksys_read+0xb9/0x135
      [  106.129225]   do_syscall_64+0x140/0x385
      [  106.129684]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.130292]
      [  106.130292] other info that might help us debug this:
      [  106.130292]
      [  106.131226] Chain exists of:
      [  106.131226]   &sbq->ws[i].wait -->
      &(&hctx->dispatch_wait_lock)->rlock -->
      &(&sb->map[i].swap_lock)->rlock
      [  106.131226]
      [  106.132865]  Possible interrupt unsafe locking scenario:
      [  106.132865]
      [  106.133659]        CPU0                    CPU1
      [  106.134194]        ----                    ----
      [  106.134733]   lock(&(&sb->map[i].swap_lock)->rlock);
      [  106.135318]                                local_irq_disable();
      [  106.136014]                                lock(&sbq->ws[i].wait);
      [  106.136747]
      lock(&(&hctx->dispatch_wait_lock)->rlock);
      [  106.137742]   <Interrupt>
      [  106.138110]     lock(&sbq->ws[i].wait);
      [  106.138625]
      [  106.138625]  *** DEADLOCK ***
      [  106.138625]
      [  106.139430] 3 locks held by fio/1043:
      [  106.139947]  #0: 0000000076ff0fd9 (rcu_read_lock){....}, at:
      hctx_lock+0x29/0xe8
      [  106.140813]  #1: 000000002feb1016 (&sbq->ws[i].wait){..-.}, at:
      blk_mq_dispatch_rq_list+0x4ad/0xd7c
      [  106.141877]  #2: 000000007eec8b2f
      (&(&hctx->dispatch_wait_lock)->rlock){....}, at:
      blk_mq_dispatch_rq_list+0x4c1/0xd7c
      [  106.143267]
      [  106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
      holding lock:
      [  106.144351]  -> (&sbq->ws[i].wait){..-.} ops: 82 {
      [  106.144926]     IN-SOFTIRQ-W at:
      [  106.145314]                       _raw_spin_lock_irqsave+0x4b/0x82
      [  106.146042]                       __wake_up_common_lock+0x119/0x1b9
      [  106.146785]                       sbitmap_queue_wake_up+0x33f/0x383
      [  106.147567]                       sbitmap_queue_clear+0x4c/0x9a
      [  106.148379]                       __blk_mq_free_request+0x188/0x1d3
      [  106.149148]                       blk_mq_free_request+0x23b/0x26b
      [  106.149864]                       scsi_end_request+0x345/0x5d7
      [  106.150546]                       scsi_io_completion+0x4b5/0x8f0
      [  106.151367]                       scsi_finish_command+0x412/0x456
      [  106.152157]                       scsi_softirq_done+0x23f/0x29b
      [  106.152855]                       blk_done_softirq+0x2a7/0x2e6
      [  106.153537]                       __do_softirq+0x360/0x6ad
      [  106.154280]                       run_ksoftirqd+0x2f/0x5b
      [  106.155020]                       smpboot_thread_fn+0x3a5/0x3db
      [  106.155828]                       kthread+0x1d4/0x1e4
      [  106.156526]                       ret_from_fork+0x3a/0x50
      [  106.157267]     INITIAL USE at:
      [  106.157713]                      _raw_spin_lock_irqsave+0x4b/0x82
      [  106.158542]                      prepare_to_wait_exclusive+0xa8/0x215
      [  106.159421]                      blk_mq_get_tag+0x34f/0x6e4
      [  106.160186]                      blk_mq_get_request+0x48e/0xaef
      [  106.160997]                      blk_mq_make_request+0x27e/0xbd2
      [  106.161828]                      generic_make_request+0x4d1/0x873
      [  106.162661]                      submit_bio+0x20c/0x253
      [  106.163379]                      mpage_bio_submit+0x44/0x4b
      [  106.164142]                      mpage_readpages+0x3c2/0x407
      [  106.164919]                      read_pages+0x13a/0x430
      [  106.165633]                      __do_page_cache_readahead+0x18e/0x2fc
      [  106.166530]                      force_page_cache_readahead+0x121/0x133
      [  106.167439]                      page_cache_sync_readahead+0x35f/0x3bb
      [  106.168337]                      generic_file_buffered_read+0x410/0x1860
      [  106.169255]                      __vfs_read+0x319/0x38f
      [  106.169977]                      vfs_read+0xd2/0x19a
      [  106.170662]                      ksys_read+0xb9/0x135
      [  106.171356]                      do_syscall_64+0x140/0x385
      [  106.172120]                      entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.173051]   }
      [  106.173308]   ... key      at: [<ffffffff85094600>] __key.26481+0x0/0x40
      [  106.174219]   ... acquired at:
      [  106.174646]    _raw_spin_lock+0x33/0x64
      [  106.175183]    blk_mq_dispatch_rq_list+0x4c1/0xd7c
      [  106.175843]    blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.176518]    blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.177262]    __blk_mq_run_hw_queue+0x137/0x17e
      [  106.177900]    __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.178591]    blk_mq_run_hw_queue+0x151/0x187
      [  106.179207]    blk_mq_sched_insert_requests+0x13f/0x175
      [  106.179926]    blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.180571]    blk_flush_plug_list+0x392/0x3d7
      [  106.181187]    blk_finish_plug+0x37/0x4f
      [  106.181737]    __se_sys_io_submit+0x171/0x304
      [  106.182346]    do_syscall_64+0x140/0x385
      [  106.182895]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.183607]
      [  106.183830] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
      [  106.184691]    INITIAL USE at:
      [  106.185119]                    _raw_spin_lock+0x33/0x64
      [  106.185838]                    blk_mq_dispatch_rq_list+0x4c1/0xd7c
      [  106.186697]                    blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.187551]                    blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.188481]                    __blk_mq_run_hw_queue+0x137/0x17e
      [  106.189307]                    __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.190189]                    blk_mq_run_hw_queue+0x151/0x187
      [  106.190989]                    blk_mq_sched_insert_requests+0x13f/0x175
      [  106.191902]                    blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.192739]                    blk_flush_plug_list+0x392/0x3d7
      [  106.193535]                    blk_finish_plug+0x37/0x4f
      [  106.194269]                    __se_sys_io_submit+0x171/0x304
      [  106.195059]                    do_syscall_64+0x140/0x385
      [  106.195794]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.196705]  }
      [  106.196950]  ... key      at: [<ffffffff84880620>] __key.51231+0x0/0x40
      [  106.197853]  ... acquired at:
      [  106.198270]    lock_acquire+0x280/0x2f3
      [  106.198806]    _raw_spin_lock+0x33/0x64
      [  106.199337]    sbitmap_get+0xd5/0x22c
      [  106.199850]    __sbitmap_queue_get+0xe8/0x177
      [  106.200450]    __blk_mq_get_tag+0x1e6/0x22d
      [  106.201035]    blk_mq_get_tag+0x1db/0x6e4
      [  106.201589]    blk_mq_get_driver_tag+0x161/0x258
      [  106.202237]    blk_mq_dispatch_rq_list+0x5b9/0xd7c
      [  106.202902]    blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.203572]    blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.204316]    __blk_mq_run_hw_queue+0x137/0x17e
      [  106.204956]    __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.205649]    blk_mq_run_hw_queue+0x151/0x187
      [  106.206269]    blk_mq_sched_insert_requests+0x13f/0x175
      [  106.206997]    blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.207644]    blk_flush_plug_list+0x392/0x3d7
      [  106.208264]    blk_finish_plug+0x37/0x4f
      [  106.208814]    __se_sys_io_submit+0x171/0x304
      [  106.209415]    do_syscall_64+0x140/0x385
      [  106.209965]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.210684]
      [  106.210904]
      [  106.210904] the dependencies between the lock to be acquired
      [  106.210905]  and SOFTIRQ-irq-unsafe lock:
      [  106.212541] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 1969 {
      [  106.213393]    HARDIRQ-ON-W at:
      [  106.213840]                     _raw_spin_lock+0x33/0x64
      [  106.214570]                     sbitmap_get+0xd5/0x22c
      [  106.215282]                     __sbitmap_queue_get+0xe8/0x177
      [  106.216086]                     __blk_mq_get_tag+0x1e6/0x22d
      [  106.216876]                     blk_mq_get_tag+0x1db/0x6e4
      [  106.217627]                     blk_mq_get_driver_tag+0x161/0x258
      [  106.218465]                     blk_mq_dispatch_rq_list+0x28e/0xd7c
      [  106.219326]                     blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.220198]                     blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.221138]                     __blk_mq_run_hw_queue+0x137/0x17e
      [  106.221975]                     __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.222874]                     blk_mq_run_hw_queue+0x151/0x187
      [  106.223686]                     blk_mq_sched_insert_requests+0x13f/0x175
      [  106.224597]                     blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.225444]                     blk_flush_plug_list+0x392/0x3d7
      [  106.226255]                     blk_finish_plug+0x37/0x4f
      [  106.227006]                     read_pages+0x3ef/0x430
      [  106.227717]                     __do_page_cache_readahead+0x18e/0x2fc
      [  106.228595]                     force_page_cache_readahead+0x121/0x133
      [  106.229491]                     page_cache_sync_readahead+0x35f/0x3bb
      [  106.230373]                     generic_file_buffered_read+0x410/0x1860
      [  106.231277]                     __vfs_read+0x319/0x38f
      [  106.231986]                     vfs_read+0xd2/0x19a
      [  106.232666]                     ksys_read+0xb9/0x135
      [  106.233350]                     do_syscall_64+0x140/0x385
      [  106.234097]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.235012]    SOFTIRQ-ON-W at:
      [  106.235460]                     _raw_spin_lock+0x33/0x64
      [  106.236195]                     sbitmap_get+0xd5/0x22c
      [  106.236913]                     __sbitmap_queue_get+0xe8/0x177
      [  106.237715]                     __blk_mq_get_tag+0x1e6/0x22d
      [  106.238488]                     blk_mq_get_tag+0x1db/0x6e4
      [  106.239244]                     blk_mq_get_driver_tag+0x161/0x258
      [  106.240079]                     blk_mq_dispatch_rq_list+0x28e/0xd7c
      [  106.240937]                     blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.241806]                     blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.242751]                     __blk_mq_run_hw_queue+0x137/0x17e
      [  106.243579]                     __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.244469]                     blk_mq_run_hw_queue+0x151/0x187
      [  106.245277]                     blk_mq_sched_insert_requests+0x13f/0x175
      [  106.246191]                     blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.247044]                     blk_flush_plug_list+0x392/0x3d7
      [  106.247859]                     blk_finish_plug+0x37/0x4f
      [  106.248749]                     read_pages+0x3ef/0x430
      [  106.249463]                     __do_page_cache_readahead+0x18e/0x2fc
      [  106.250357]                     force_page_cache_readahead+0x121/0x133
      [  106.251263]                     page_cache_sync_readahead+0x35f/0x3bb
      [  106.252157]                     generic_file_buffered_read+0x410/0x1860
      [  106.253084]                     __vfs_read+0x319/0x38f
      [  106.253808]                     vfs_read+0xd2/0x19a
      [  106.254488]                     ksys_read+0xb9/0x135
      [  106.255186]                     do_syscall_64+0x140/0x385
      [  106.255943]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.256867]    INITIAL USE at:
      [  106.257300]                    _raw_spin_lock+0x33/0x64
      [  106.258033]                    sbitmap_get+0xd5/0x22c
      [  106.258747]                    __sbitmap_queue_get+0xe8/0x177
      [  106.259542]                    __blk_mq_get_tag+0x1e6/0x22d
      [  106.260320]                    blk_mq_get_tag+0x1db/0x6e4
      [  106.261072]                    blk_mq_get_driver_tag+0x161/0x258
      [  106.261902]                    blk_mq_dispatch_rq_list+0x28e/0xd7c
      [  106.262762]                    blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.263626]                    blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.264571]                    __blk_mq_run_hw_queue+0x137/0x17e
      [  106.265409]                    __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.266302]                    blk_mq_run_hw_queue+0x151/0x187
      [  106.267111]                    blk_mq_sched_insert_requests+0x13f/0x175
      [  106.268028]                    blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.268878]                    blk_flush_plug_list+0x392/0x3d7
      [  106.269694]                    blk_finish_plug+0x37/0x4f
      [  106.270432]                    read_pages+0x3ef/0x430
      [  106.271139]                    __do_page_cache_readahead+0x18e/0x2fc
      [  106.272040]                    force_page_cache_readahead+0x121/0x133
      [  106.272932]                    page_cache_sync_readahead+0x35f/0x3bb
      [  106.273811]                    generic_file_buffered_read+0x410/0x1860
      [  106.274709]                    __vfs_read+0x319/0x38f
      [  106.275407]                    vfs_read+0xd2/0x19a
      [  106.276074]                    ksys_read+0xb9/0x135
      [  106.276764]                    do_syscall_64+0x140/0x385
      [  106.277500]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  106.278417]  }
      [  106.278676]  ... key      at: [<ffffffff85094640>] __key.26212+0x0/0x40
      [  106.279586]  ... acquired at:
      [  106.280026]    lock_acquire+0x280/0x2f3
      [  106.280559]    _raw_spin_lock+0x33/0x64
      [  106.281101]    sbitmap_get+0xd5/0x22c
      [  106.281610]    __sbitmap_queue_get+0xe8/0x177
      [  106.282221]    __blk_mq_get_tag+0x1e6/0x22d
      [  106.282809]    blk_mq_get_tag+0x1db/0x6e4
      [  106.283368]    blk_mq_get_driver_tag+0x161/0x258
      [  106.284018]    blk_mq_dispatch_rq_list+0x5b9/0xd7c
      [  106.284685]    blk_mq_do_dispatch_sched+0x23a/0x287
      [  106.285371]    blk_mq_sched_dispatch_requests+0x379/0x3fc
      [  106.286135]    __blk_mq_run_hw_queue+0x137/0x17e
      [  106.286806]    __blk_mq_delay_run_hw_queue+0x80/0x25f
      [  106.287515]    blk_mq_run_hw_queue+0x151/0x187
      [  106.288149]    blk_mq_sched_insert_requests+0x13f/0x175
      [  106.289041]    blk_mq_flush_plug_list+0x7d6/0x81b
      [  106.289912]    blk_flush_plug_list+0x392/0x3d7
      [  106.290590]    blk_finish_plug+0x37/0x4f
      [  106.291238]    __se_sys_io_submit+0x171/0x304
      [  106.291864]    do_syscall_64+0x140/0x385
      [  106.292534]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Reported-by: default avatarMing Lei <ming.lei@redhat.com>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      58ab5e32
  6. 30 Nov, 2018 2 commits
    • Jens Axboe's avatar
      sbitmap: optimize wakeup check · 5d2ee712
      Jens Axboe authored
      Even if we have no waiters on any of the sbitmap_queue wait states, we
      still have to loop every entry to check. We do this for every IO, so
      the cost adds up.
      
      Shift a bit of the cost to the slow path, when we actually have waiters.
      Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
      an internal count of how many are currently active. Then we can simply
      check this count in sbq_wake_ptr() and not have to loop if we don't
      have any sleepers.
      
      Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5d2ee712
    • Jens Axboe's avatar
      sbitmap: ammortize cost of clearing bits · ea86ea2c
      Jens Axboe authored
      sbitmap maintains a set of words that we use to set and clear bits, with
      each bit representing a tag for blk-mq. Even though we spread the bits
      out and maintain a hint cache, one particular bit allocated will end up
      being cleared in the exact same spot.
      
      This introduces batched clearing of bits. Instead of clearing a given
      bit, the same bit is set in a cleared/free mask instead. If we fail
      allocating a bit from a given word, then we check the free mask, and
      batch move those cleared bits at that time. This trades 64 atomic bitops
      for 2 cmpxchg().
      
      In a threaded poll test case, half the overhead of getting and clearing
      tags is removed with this change. On another poll test case with a
      single thread, performance is unchanged.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea86ea2c
  7. 29 Nov, 2018 1 commit
  8. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kzalloc_node() -> kcalloc_node() · 590b5b7d
      Kees Cook authored
      The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
      patch replaces cases of:
      
              kzalloc_node(a * b, gfp, node)
      
      with:
              kcalloc_node(a * b, gfp, node)
      
      as well as handling cases of:
      
              kzalloc_node(a * b * c, gfp, node)
      
      with:
      
              kzalloc_node(array3_size(a, b, c), gfp, node)
      
      as it's slightly less ugly than:
      
              kcalloc_node(array_size(a, b), c, gfp, node)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc_node(4 * 1024, gfp, node)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc_node(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc_node(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc_node(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc_node
      + kcalloc_node
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc_node(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc_node(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc_node(C1 * C2 * C3, ...)
      |
        kzalloc_node(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc_node(sizeof(THING) * C2, ...)
      |
        kzalloc_node(sizeof(TYPE) * C2, ...)
      |
        kzalloc_node(C1 * C2 * C3, ...)
      |
        kzalloc_node(C1 * C2, ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      590b5b7d
  9. 24 May, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: avoid starving tag allocation after allocating process migrates · e6fc4649
      Ming Lei authored
      When the allocation process is scheduled back and the mapped hw queue is
      changed, fake one extra wake up on previous queue for compensating wake
      up miss, so other allocations on the previous queue won't be starved.
      
      This patch fixes one request allocation hang issue, which can be
      triggered easily in case of very low nr_request.
      
      The race is as follows:
      
      1) 2 hw queues, nr_requests are 2, and wake_batch is one
      
      2) there are 3 waiters on hw queue 0
      
      3) two in-flight requests in hw queue 0 are completed, and only two
         waiters of 3 are waken up because of wake_batch, but both the two
         waiters can be scheduled to another CPU and cause to switch to hw
         queue 1
      
      4) then the 3rd waiter will wait for ever, since no in-flight request
         is in hw queue 0 any more.
      
      5) this patch fixes it by the fake wakeup when waiter is scheduled to
         another hw queue
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      
      Modified commit message to make it clearer, and make it apply on
      top of the 4.18 branch.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e6fc4649
  10. 14 May, 2018 1 commit
    • Jens Axboe's avatar
      sbitmap: fix race in wait batch accounting · c854ab57
      Jens Axboe authored
      If we have multiple callers of sbq_wake_up(), we can end up in a
      situation where the wait_cnt will continually go more and more
      negative. Consider the case where our wake batch is 1, hence
      wait_cnt will start out as 1.
      
      wait_cnt == 1
      
      CPU0				CPU1
      atomic_dec_return(), cnt == 0
      				atomic_dec_return(), cnt == -1
      				cmpxchg(-1, 0) (succeeds)
      				[wait_cnt now 0]
      cmpxchg(0, 1) (fails)
      
      This ends up with wait_cnt being 0, we'll wakeup immediately
      next time. Going through the same loop as above again, and
      we'll have wait_cnt -1.
      
      For the case where we have a larger wake batch, the only
      difference is that the starting point will be higher. We'll
      still end up with continually smaller batch wakeups, which
      defeats the purpose of the rolling wakeups.
      
      Always reset the wait_cnt to the batch value. Then it doesn't
      matter who wins the race. But ensure that whomever does win
      the race is the one that increments the ws index and wakes up
      our batch count, loser gets to call __sbq_wake_up() again to
      account his wakeups towards the next active wait state index.
      
      Fixes: 6c0ca7ae ("sbitmap: fix wakeup hang after sbq resize")
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c854ab57
  11. 10 May, 2018 2 commits
  12. 28 Feb, 2018 1 commit
    • Omar Sandoval's avatar
      sbitmap: use test_and_set_bit_lock()/clear_bit_unlock() · 4ace53f1
      Omar Sandoval authored
      sbitmap_queue_get()/sbitmap_queue_clear() are used for
      allocating/freeing a resource, so they should provide acquire/release
      barrier semantics, respectively. sbitmap_get() currently contains a full
      barrier, which is unnecessary, so use test_and_set_bit_lock() instead of
      test_and_set_bit() (these are equivalent on x86_64). sbitmap_clear_bit()
      does not imply any barriers, which is incorrect, as accesses of the
      resource (e.g., request) could potentially get reordered to after the
      clear_bit(). Introduce sbitmap_clear_bit_unlock() and use it for
      sbitmap_queue_clear() (this only adds a compiler barrier on x86_64). The
      other existing user of sbitmap_clear_bit() (the blk-mq software queue
      pending map) is serialized through a spinlock and does not need this.
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4ace53f1
  13. 22 Dec, 2017 1 commit
    • Jens Axboe's avatar
      blk-mq: improve heavily contended tag case · 4e5dff41
      Jens Axboe authored
      Even with a number of waitqueues, we can get into a situation where we
      are heavily contended on the waitqueue lock. I got a report on spc1
      where we're spending seconds doing this. Arguably the use case is nasty,
      I reproduce it with one device and 1000 threads banging on the device.
      But that doesn't mean we shouldn't be handling it better.
      
      What ends up happening is that a thread will fail to get a tag, add
      itself to the waitqueue, and subsequently get woken up when a tag is
      freed - only to find itself going back to sleep on the waitqueue.
      
      Instead of waking all threads, use an exclusive wait and wake up our
      sbitmap batch count instead. This seems to work well for me (massive
      improvement for this use case), and it survives basic testing. But I
      haven't fully verified it yet.
      
      An additional improvement is running the queue and checking for a new
      tag BEFORE needing to add ourselves to the waitqueue.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e5dff41
  14. 14 Apr, 2017 1 commit
    • Omar Sandoval's avatar
      sbitmap: add sbitmap_get_shallow() operation · c05e6673
      Omar Sandoval authored
      This operation supports the use case of limiting the number of bits that
      can be allocated for a given operation. Rather than setting aside some
      bits at the end of the bitmap, we can set aside bits in each word of the
      bitmap. This means we can keep the allocation hints spread out and
      support sbitmap_resize() nicely at the cost of lower granularity for the
      allowed depth.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c05e6673
  15. 02 Mar, 2017 1 commit
    • Ingo Molnar's avatar
      kasan, sched/headers: Uninline kasan_enable/disable_current() · af8601ad
      Ingo Molnar authored
      <linux/kasan.h> is a low level header that is included early
      in affected kernel headers. But it includes <linux/sched.h>
      which complicates the cleanup of sched.h dependencies.
      
      But kasan.h has almost no need for sched.h: its only use of
      scheduler functionality is in two inline functions which are
      not used very frequently - so uninline kasan_enable_current()
      and kasan_disable_current().
      
      Also add a <linux/sched.h> dependency to a .c file that depended
      on kasan.h including it.
      
      This paves the way to remove the <linux/sched.h> include from kasan.h.
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      af8601ad
  16. 27 Jan, 2017 1 commit
  17. 18 Jan, 2017 2 commits
    • Omar Sandoval's avatar
      sbitmap: fix wakeup hang after sbq resize · 6c0ca7ae
      Omar Sandoval authored
      When we resize a struct sbitmap_queue, we update the wakeup batch size,
      but we don't update the wait count in the struct sbq_wait_states. If we
      resized down from a size which could use a bigger batch size, these
      counts could be too large and cause us to miss necessary wakeups. To fix
      this, update the wait counts when we resize (ensuring some careful
      memory ordering so that it's safe w.r.t. concurrent clears).
      
      This also fixes a theoretical issue where two threads could end up
      bumping the wait count up by the batch size, which could also
      potentially lead to hangs.
      Reported-by: default avatarMartin Raiber <martin@urbackup.org>
      Fixes: e3a2b3f9 ("blk-mq: allow changing of queue depth through sysfs")
      Fixes: 2971c35f ("blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6c0ca7ae
    • Omar Sandoval's avatar
      sbitmap: use smp_mb__after_atomic() in sbq_wake_up() · f66227de
      Omar Sandoval authored
      We always do an atomic clear_bit() right before we call sbq_wake_up(),
      so we can use smp_mb__after_atomic(). While we're here, comment the
      memory barriers in here a little more.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f66227de
  18. 19 Sep, 2016 1 commit
  19. 17 Sep, 2016 7 commits