1. 12 Feb, 2019 1 commit
    • Jianchao Wang's avatar
      blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue · aef1897c
      Jianchao Wang authored
      When requeue, if RQF_DONTPREP, rq has contained some driver
      specific data, so insert it to hctx dispatch list to avoid any
      merge. Take scsi as example, here is the trace event log (no
      io scheduler, because RQF_STARTED would prevent merging),
      
         kworker/0:1H-339   [000] ...1  2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
      scsi_inert_test-1987  [000] ....  2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
      scsi_inert_test-1987  [000] ...2  2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
         kworker/0:1H-339   [000] ....  2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
      scsi_inert_test-1996  [000] ..s1  2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
      scsi_inert_test-1996  [000] .Ns1  2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
         kworker/0:1H-339   [000] ...1  2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
         kworker/0:1H-339   [000] ...1  2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
      scsi_inert_test-1986  [000] ..s1  2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
      
      (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
      Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
      the sdb only contained the part of (32768 + 8), then only that part
      was completed. The lucky thing was that scsi_io_completion detected
      it and requeued the remaining part. So we didn't get corrupted data.
      However, the requeue of (32776 + 8) is not expected.
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aef1897c
  2. 16 Jan, 2019 1 commit
  3. 18 Dec, 2018 3 commits
  4. 17 Dec, 2018 3 commits
    • Ming Lei's avatar
      blk-mq: skip zero-queue maps in blk_mq_map_swqueue · e5edd5f2
      Ming Lei authored
      From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping
      table won't be initialized actually if map->nr_queues is zero, so
      we can't use blk_mq_map_queue_type() to retrieve hctx any more.
      
      This way still may cause broken mapping, fix it by skipping zero-queues
      maps in blk_mq_map_swqueue().
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5edd5f2
    • Ming Lei's avatar
      blk-mq: fix dispatch from sw queue · c16d6b5a
      Ming Lei authored
      When a request is added to rq list of sw queue(ctx), the rq may be from
      a different type of hctx, especially after multi queue mapping is
      introduced.
      
      So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
      blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
      hctx can be dispatched to current hctx in case that read queue or poll
      queue is enabled.
      
      This patch fixes this issue by introducing per-queue-type list.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      
      Changed by me to not use separately cacheline aligned lists, just
      place them all in the same cacheline where we had just the one list
      and lock before.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c16d6b5a
    • Ming Lei's avatar
      blk-mq: fix allocation for queue mapping table · 07b35eb5
      Ming Lei authored
      Type of each element in queue mapping table is 'unsigned int,
      intead of 'struct blk_mq_queue_map)', so fix it.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      07b35eb5
  5. 16 Dec, 2018 3 commits
  6. 10 Dec, 2018 1 commit
  7. 08 Dec, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: re-build queue map in case of kdump kernel · 59388702
      Ming Lei authored
      Now almost all .map_queues() implementation based on managed irq
      affinity doesn't update queue mapping and it just retrieves the
      old built mapping, so if nr_hw_queues is changed, the mapping talbe
      includes stale mapping. And only blk_mq_map_queues() may rebuild
      the mapping talbe.
      
      One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
      However, drivers often builds queue mapping before allocating tagset
      via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
      as 1 in case of kdump kernel, so wrong queue mapping is used, and
      kernel panic[1] is observed during booting.
      
      This patch fixes the kernel panic triggerd on nvme by rebulding the
      mapping table via blk_mq_map_queues().
      
      [1] kernel panic log
      [    4.438371] nvme nvme0: 16/0/0 default/read/poll queues
      [    4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      [    4.444681] PGD 0 P4D 0
      [    4.445367] Oops: 0000 [#1] SMP NOPTI
      [    4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
      [    4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
      [    4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [    4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
      [    4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
      [    4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
      [    4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
      [    4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
      [    4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
      [    4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
      [    4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
      [    4.469220] FS:  0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
      [    4.471554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
      [    4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    4.477061] PKRU: 55555554
      [    4.477464] Call Trace:
      [    4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
      [    4.479595]  blk_mq_init_queue+0x32/0x4e
      [    4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
      [    4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
      [    4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
      [    4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
      [    4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
      [    4.483930]  ? try_to_wake_up+0x38d/0x3b3
      [    4.484478]  ? process_one_work+0x179/0x2fc
      [    4.485118]  process_one_work+0x1d3/0x2fc
      [    4.485655]  ? rescuer_thread+0x2ae/0x2ae
      [    4.486196]  worker_thread+0x1e9/0x2be
      [    4.486841]  kthread+0x115/0x11d
      [    4.487294]  ? kthread_park+0x76/0x76
      [    4.487784]  ret_from_fork+0x3a/0x50
      [    4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
      [    4.489428] Dumping ftrace buffer:
      [    4.489939]    (ftrace buffer empty)
      [    4.490492] CR2: 0000000000000098
      [    4.491052] ---[ end trace 03cd268ad5a86ff7 ]---
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: linux-nvme@lists.infradead.org
      Cc: David Milburn <dmilburn@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59388702
  8. 07 Dec, 2018 1 commit
    • Jens Axboe's avatar
      blk-mq: punt failed direct issue to dispatch list · c616cbee
      Jens Axboe authored
      After the direct dispatch corruption fix, we permanently disallow direct
      dispatch of non read/write requests. This works fine off the normal IO
      path, as they will be retried like any other failed direct dispatch
      request. But for the blk_insert_cloned_request() that only DM uses to
      bypass the bottom level scheduler, we always first attempt direct
      dispatch. For some types of requests, that's now a permanent failure,
      and no amount of retrying will make that succeed. This results in a
      livelock.
      
      Instead of making special cases for what we can direct issue, and now
      having to deal with DM solving the livelock while still retaining a BUSY
      condition feedback loop, always just add a request that has been through
      ->queue_rq() to the hardware queue dispatch list. These are safe to use
      as no merging can take place there. Additionally, if requests do have
      prepped data from drivers, we aren't dependent on them not sharing space
      in the request structure to safely add them to the IO scheduler lists.
      
      This basically reverts ffe81d45 and is based on a patch from Ming,
      but with the list insert case covered as well.
      
      Fixes: ffe81d45 ("blk-mq: fix corruption with direct issue")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarMing Lei <ming.lei@redhat.com>
      Reported-by: default avatarBart Van Assche <bvanassche@acm.org>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c616cbee
  9. 05 Dec, 2018 1 commit
    • Jens Axboe's avatar
      blk-mq: fix corruption with direct issue · ffe81d45
      Jens Axboe authored
      If we attempt a direct issue to a SCSI device, and it returns BUSY, then
      we queue the request up normally. However, the SCSI layer may have
      already setup SG tables etc for this particular command. If we later
      merge with this request, then the old tables are no longer valid. Once
      we issue the IO, we only read/write the original part of the request,
      not the new state of it.
      
      This causes data corruption, and is most often noticed with the file
      system complaining about the just read data being invalid:
      
      [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
      
      because most of it is garbage...
      
      This doesn't happen from the normal issue path, as we will simply defer
      the request to the hardware queue dispatch list if we fail. Once it's on
      the dispatch list, we never merge with it.
      
      Fix this from the direct issue path by flagging the request as
      REQ_NOMERGE so we don't change the size of it before issue.
      
      See also:
        https://bugzilla.kernel.org/show_bug.cgi?id=201685Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Fixes: 6ce3dd6e ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ffe81d45
  10. 04 Dec, 2018 2 commits
  11. 03 Dec, 2018 1 commit
  12. 29 Nov, 2018 5 commits
  13. 28 Nov, 2018 1 commit
  14. 26 Nov, 2018 8 commits
  15. 21 Nov, 2018 1 commit
  16. 20 Nov, 2018 1 commit
  17. 19 Nov, 2018 2 commits
  18. 16 Nov, 2018 1 commit
  19. 15 Nov, 2018 3 commits