1. 17 Jan, 2019 1 commit
  2. 09 Jan, 2019 2 commits
  3. 21 Dec, 2018 4 commits
  4. 20 Dec, 2018 1 commit
  5. 19 Dec, 2018 2 commits
    • Ming Lei's avatar
      block: save irq state in blkg_lookup_create() · 3a762de5
      Ming Lei authored
      blkg_lookup_create() may be called from pool_map() in which
      irq state is saved, so we have to do that in blkg_lookup_create().
      
      Otherwise, the following lockdep warning can be triggered:
      
      [  104.258537] ================================
      [  104.259129] WARNING: inconsistent lock state
      [  104.259725] 4.20.0-rc6+ #545 Not tainted
      [  104.260268] --------------------------------
      [  104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [  104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
      [  104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.263747] {SOFTIRQ-ON-W} state was registered at:
      [  104.264417]   _raw_spin_unlock_irq+0x29/0x4c
      [  104.265014]   blkg_lookup_create+0xdc/0xe6
      [  104.265609]   bio_associate_blkg_from_css+0xd3/0x13f
      [  104.266312]   bio_associate_blkg+0x15a/0x1bb
      [  104.266913]   pool_map+0xe8/0x103 [dm_thin_pool]
      [  104.267572]   __map_bio+0x98/0x29c [dm_mod]
      [  104.268162]   __split_and_process_non_flush+0x29e/0x306 [dm_mod]
      [  104.269003]   __split_and_process_bio+0x16a/0x25b [dm_mod]
      [  104.269971]   __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
      [  104.270973]   generic_make_request+0x3f5/0x68b
      [  104.271676]   process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
      [  104.272531]   schedule_zero+0x239/0x273 [dm_thin_pool]
      [  104.273245]   process_cell+0x60c/0x6f1 [dm_thin_pool]
      [  104.273967]   do_worker+0x60c/0xca8 [dm_thin_pool]
      [  104.274635]   process_one_work+0x4eb/0x834
      [  104.275203]   worker_thread+0x318/0x484
      [  104.275740]   kthread+0x1d1/0x1e1
      [  104.276203]   ret_from_fork+0x3a/0x50
      [  104.276714] irq event stamp: 170003
      [  104.277201] hardirqs last  enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b
      [  104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55
      [  104.280273] softirqs last  enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73
      [  104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d
      [  104.282744]
      [  104.282744] other info that might help us debug this:
      [  104.283640]  Possible unsafe locking scenario:
      [  104.283640]
      [  104.284452]        CPU0
      [  104.284803]        ----
      [  104.285150]   lock(&(&pool->lock)->rlock#3);
      [  104.285762]   <Interrupt>
      [  104.286130]     lock(&(&pool->lock)->rlock#3);
      [  104.286750]
      [  104.286750]  *** DEADLOCK ***
      [  104.286750]
      [  104.287564] no locks held by swapper/49/0.
      [  104.288129]
      [  104.288129] stack backtrace:
      [  104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
      [  104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
      [  104.290858] Call Trace:
      [  104.291204]  <IRQ>
      [  104.291502]  dump_stack+0x9a/0xe6
      [  104.291968]  mark_lock+0x56c/0x7a6
      [  104.292442]  ? check_usage_backwards+0x209/0x209
      [  104.293086]  __lock_acquire+0x400/0x15bf
      [  104.293662]  ? check_chain_key+0x150/0x1aa
      [  104.294236]  lock_acquire+0x1a6/0x1e3
      [  104.294768]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.295444]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
      [  104.296143]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
      [  104.297031]  _raw_spin_lock_irqsave+0x46/0x55
      [  104.297659]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.298335]  thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.298997]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
      [  104.299886]  ? check_flags+0x20a/0x20a
      [  104.300408]  ? lock_acquire+0x1a6/0x1e3
      [  104.300954]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
      [  104.301865]  clone_endio+0x1bb/0x22d [dm_mod]
      [  104.302491]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
      [  104.303200]  ? bio_disassociate_blkg+0xc6/0x15f
      [  104.303836]  ? bio_endio+0x2b2/0x2da
      [  104.304349]  clone_endio+0x1f3/0x22d [dm_mod]
      [  104.304978]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
      [  104.305709]  ? bio_disassociate_blkg+0xc6/0x15f
      [  104.306333]  ? bio_endio+0x2b2/0x2da
      [  104.306853]  clone_endio+0x1f3/0x22d [dm_mod]
      [  104.307476]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
      [  104.308185]  ? bio_disassociate_blkg+0xc6/0x15f
      [  104.308817]  ? bio_endio+0x2b2/0x2da
      [  104.309319]  blk_update_request+0x2de/0x4cc
      [  104.309927]  blk_mq_end_request+0x2a/0x183
      [  104.310498]  blk_done_softirq+0x16a/0x1a6
      [  104.311051]  ? blk_softirq_cpu_dead+0xe2/0xe2
      [  104.311653]  ? __lock_is_held+0x2a/0x87
      [  104.312186]  __do_softirq+0x250/0x4e8
      [  104.312705]  irq_exit+0x7e/0x11d
      [  104.313157]  call_function_single_interrupt+0xf/0x20
      [  104.313860]  </IRQ>
      [  104.314163] RIP: 0010:native_safe_halt+0x2/0x3
      [  104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
      [  104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
      [  104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
      [  104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
      [  104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
      [  104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
      [  104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
      [  104.323307]  ? lockdep_hardirqs_on+0x26b/0x278
      [  104.323927]  default_idle+0xd9/0x1a8
      [  104.324427]  do_idle+0x162/0x2b2
      [  104.324891]  ? arch_cpu_idle_exit+0x28/0x28
      [  104.325467]  ? mark_held_locks+0x28/0x7f
      [  104.326031]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
      [  104.326719]  cpu_startup_entry+0x1d/0x1f
      [  104.327261]  start_secondary+0x2cb/0x308
      [  104.327806]  ? set_cpu_sibling_map+0x8a3/0x8a3
      [  104.328421]  secondary_startup_64+0xa4/0xb0
      
      Fixes: b978962a ("blkcg: update blkg_lookup_create() to do locking")
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      3a762de5
    • Christoph Hellwig's avatar
      scsi: block: remove the cluster flag · 38417468
      Christoph Hellwig authored
      Now that the the SCSI layer replaced the use of the cluster flag with
      segment size limits and the DMA boundary we can remove the cluster flag
      from the block layer.
      Signed-off-by: 's avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarMartin K. Petersen <martin.petersen@oracle.com>
      38417468
  6. 18 Dec, 2018 3 commits
  7. 17 Dec, 2018 9 commits
    • Ming Lei's avatar
      blk-mq: skip zero-queue maps in blk_mq_map_swqueue · e5edd5f2
      Ming Lei authored
      From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping
      table won't be initialized actually if map->nr_queues is zero, so
      we can't use blk_mq_map_queue_type() to retrieve hctx any more.
      
      This way still may cause broken mapping, fix it by skipping zero-queues
      maps in blk_mq_map_swqueue().
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      e5edd5f2
    • Dennis Zhou's avatar
      block: fix blk-iolatency accounting underflow · 13369816
      Dennis Zhou authored
      The blk-iolatency controller measures the time from rq_qos_throttle() to
      rq_qos_done_bio() and attributes this time to the first bio that needs
      to create the request. This means if a bio is plug-mergeable or
      bio-mergeable, it gets to bypass the blk-iolatency controller.
      
      The recent series [1], to tag all bios w/ blkgs undermined how iolatency
      was determining which bios it was charging and should process in
      rq_qos_done_bio(). Because all bios are being tagged, this caused the
      atomic_t for the struct rq_wait inflight count to underflow and result
      in a stall.
      
      This patch adds a new flag BIO_TRACKED to let controllers know that a
      bio is going through the rq_qos path. blk-iolatency now checks if this
      flag is set to see if it should process the bio in rq_qos_done_bio().
      
      Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
      BIO_THROTTLED was another candidate, but the flag is set for all bios
      that have gone through blk-throttle code. Overloading a flag comes with
      the burden of making sure that when either implementation changes, a
      change in setting rules for one doesn't cause a bug in the other. So
      here, we unfortunately opt for adding a new flag.
      
      [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Signed-off-by: 's avatarDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      13369816
    • Ming Lei's avatar
      blk-mq: fix dispatch from sw queue · c16d6b5a
      Ming Lei authored
      When a request is added to rq list of sw queue(ctx), the rq may be from
      a different type of hctx, especially after multi queue mapping is
      introduced.
      
      So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
      blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
      hctx can be dispatched to current hctx in case that read queue or poll
      queue is enabled.
      
      This patch fixes this issue by introducing per-queue-type list.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      
      Changed by me to not use separately cacheline aligned lists, just
      place them all in the same cacheline where we had just the one list
      and lock before.
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      c16d6b5a
    • Damien Le Moal's avatar
      block: mq-deadline: Fix write completion handling · 7211aef8
      Damien Le Moal authored
      For a zoned block device using mq-deadline, if a write request for a
      zone is received while another write was already dispatched for the same
      zone, dd_dispatch_request() will return NULL and the newly inserted
      write request is kept in the scheduler queue waiting for the ongoing
      zone write to complete. With this behavior, when no other request has
      been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
      and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
      __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
      queue when the already dispatched write request completes. The newly
      dispatched request stays stuck in the scheduler queue until eventually
      another request is submitted.
      
      This problem does not affect SCSI disk as the SCSI stack handles queue
      restart on request completion. However, this problem is can be triggered
      the nullblk driver with zoned mode enabled.
      
      Fix this by always requesting a queue restart in dd_dispatch_request()
      if no request was dispatched while WRITE requests are queued.
      
      Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarDamien Le Moal <damien.lemoal@wdc.com>
      
      Add missing export of blk_mq_sched_restart()
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      7211aef8
    • Christoph Hellwig's avatar
      blk-mq: only dispatch to non-defauly queue maps if they have queues · 5aceaeb2
      Christoph Hellwig authored
      We should check if a given queue map actually has queues enabled before
      dispatching to it.  This allows drivers to not initialize optional but
      not used map types, which subsequently will allow fixing problems with
      queue map rebuilds for that case.
      Reviewed-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      5aceaeb2
    • Ming Lei's avatar
      blk-mq: export hctx->type in debugfs instead of sysfs · 346fc108
      Ming Lei authored
      Now we only export hctx->type via sysfs, and there isn't such info
      in hctx entry under debugfs. We often use debugfs only to diagnose
      queue mapping issue, so add the support in debugfs.
      
      Queue mapping becomes a bit more complicated after multiple queue
      mapping is supported, we may write blktest to verify if queue mapping
      is valid based on blk-mq-debugfs.
      
      Given not necessary to export hctx->type twice, so remove the export
      from sysfs.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      346fc108
    • Ming Lei's avatar
      blk-mq: fix allocation for queue mapping table · 07b35eb5
      Ming Lei authored
      Type of each element in queue mapping table is 'unsigned int,
      intead of 'struct blk_mq_queue_map)', so fix it.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      07b35eb5
    • Ming Lei's avatar
      blk-wbt: export internal state via debugfs · d19afebc
      Ming Lei authored
      This information is helpful to either investigate issues, or understand
      wbt's internal behaviour.
      
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      d19afebc
    • Ming Lei's avatar
      blk-mq-debugfs: support rq_qos · cc56694f
      Ming Lei authored
      blk-mq-debugfs has been proved as very helpful for debug some
      tough issues, such as IO hang.
      
      We have seen blk-wbt related IO hang several times, even inside
      Red Hat BZ, there is such report not sovled yet, so this patch
      adds support debugfs on rq_qos.
      
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      cc56694f
  8. 16 Dec, 2018 6 commits
  9. 14 Dec, 2018 4 commits
  10. 13 Dec, 2018 2 commits
  11. 12 Dec, 2018 1 commit
    • Ming Lei's avatar
      block: deactivate blk_stat timer in wbt_disable_default() · 544fbd16
      Ming Lei authored
      rwb_enabled() can't be changed when there is any inflight IO.
      
      wbt_disable_default() may set rwb->wb_normal as zero, however the
      blk_stat timer may still be pending, and the timer function will update
      wrb->wb_normal again.
      
      This patch introduces blk_stat_deactivate() and applies it in
      wbt_disable_default(), then the following IO hang triggered when running
      parted & switching io scheduler can be fixed:
      
      [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
      [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
      [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  369.940768] parted          D    0  3645   3239 0x00000000
      [  369.941500] Call Trace:
      [  369.941874]  ? __schedule+0x6d9/0x74c
      [  369.942392]  ? wbt_done+0x5e/0x5e
      [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
      [  369.943404]  ? wbt_done+0x5e/0x5e
      [  369.943874]  schedule+0x67/0x78
      [  369.944298]  io_schedule+0x12/0x33
      [  369.944771]  rq_qos_wait+0xb5/0x119
      [  369.945193]  ? karma_partition+0x1c2/0x1c2
      [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
      [  369.946151]  wbt_wait+0x85/0xb6
      [  369.946540]  __rq_qos_throttle+0x23/0x2f
      [  369.947014]  blk_mq_make_request+0xe6/0x40a
      [  369.947518]  generic_make_request+0x192/0x2fe
      [  369.948042]  ? submit_bio+0x103/0x11f
      [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
      [  369.949011]  submit_bio+0x103/0x11f
      [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
      [  369.949962]  submit_bio_wait+0x53/0x7f
      [  369.950469]  blkdev_issue_flush+0x8a/0xae
      [  369.951032]  blkdev_fsync+0x2f/0x3a
      [  369.951502]  do_fsync+0x2e/0x47
      [  369.951887]  __x64_sys_fsync+0x10/0x13
      [  369.952374]  do_syscall_64+0x89/0x149
      [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  369.953492] RIP: 0033:0x7f95a1e729d4
      [  369.953996] Code: Bad RIP value.
      [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
      [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
      [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
      [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
      [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
      
      Cc: stable@vger.kernel.org
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      544fbd16
  12. 11 Dec, 2018 1 commit
  13. 10 Dec, 2018 4 commits