1. 08 Mar, 2018 2 commits
    • James Smart's avatar
      nvme_fc: rework sqsize handling · d157e534
      James Smart authored
      Corrected four outstanding issues in the transport around sqsize.
      1: Create Connection LS is sending the 1's-based sqsize, should be
      sending the 0's-based value.
      2: allocation of hw queue is using the 0's-base size. It should be
      using the 1's-based value.
      3: normalization of ctrl.sqsize by MQES is using MQES+1 (1's-based
      value). It should be MQES (0's-based value).
      4: Missing clause to ensure queue_count not larger than ctrl->sqsize.
      Corrected by:
      Clean up routines that pass queue size around. The queue size value is
      the actual count (1's-based) value and determined from ctrl->sqsize + 1.
      Routines that send 0's-based value adapt from queue size.
      Sset ctrl->sqsize properly for MQES.
      Added clause to nsure queue_count not larger than ctrl->sqsize + 1.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
    • Roland Dreier's avatar
      nvme-fabrics: Ignore nr_io_queues option for discovery controllers · 0475821e
      Roland Dreier authored
      This removes a dependency on the order options are passed when creating
      a fabrics controller.  With the old code, if "nr_io_queues" appears before
      an "nqn" option specifying the discovery controller, then nr_io_queues
      is overridden with zero.  If "nr_io_queues" appears after specifying the
      discovery controller, then the nr_io_queues option is used to set the
      number of queues, and the driver attempts to establish IO connections
      to the discovery controller (which doesn't work).
      It seems better to ignore (and warn about) the "nr_io_queues" option
      if userspace has already asked to connect to the discovery controller.
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
  2. 07 Mar, 2018 1 commit
  3. 01 Mar, 2018 2 commits
    • Ming Lei's avatar
      nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors · 16ccfff2
      Ming Lei authored
      84676c1f ("genirq/affinity: assign vectors to all possible CPUs")
      has switched to do irq vectors spread among all possible CPUs, so
      pass num_possible_cpus() as max vecotrs to be assigned.
      For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
      see 'lscpu':
              Architecture:          x86_64
              CPU op-mode(s):        32-bit, 64-bit
              Byte Order:            Little Endian
              CPU(s):                4
              On-line CPU(s) list:   0-3
              Thread(s) per core:    1
              Core(s) per socket:    2
              Socket(s):             2
              NUMA node(s):          2
              NUMA node0 CPU(s):     0-3
              NUMA node1 CPU(s):
      1) before this patch, follows the allocated vectors and their affinity:
      	irq 47, cpu list 0,4
      	irq 48, cpu list 1,6
      	irq 49, cpu list 2,5
      	irq 50, cpu list 3,7
      2) after this patch, follows the allocated vectors and their affinity:
      	irq 43, cpu list 0
      	irq 44, cpu list 1
      	irq 45, cpu list 2
      	irq 46, cpu list 3
      	irq 47, cpu list 4
      	irq 48, cpu list 6
      	irq 49, cpu list 5
      	irq 50, cpu list 7
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
    • Wen Xiong's avatar
      nvme-pci: Fix EEH failure on ppc · 651438bb
      Wen Xiong authored
      Triggering PPC EEH detection and handling requires a memory mapped read
      failure. The NVMe driver removed the periodic health check MMIO, so
      there's no early detection mechanism to trigger the recovery. Instead,
      the detection now happens when the nvme driver handles an IO timeout
      event. This takes the pci channel offline, so we do not want the driver
      to proceed with escalating its own recovery efforts that may conflict
      with the EEH handler.
      This patch ensures the driver will observe the channel was set to offline
      after a failed MMIO read and resets the IO timer so the EEH handler has
      a chance to recover the device.
      Signed-off-by: default avatarWen Xiong <wenxiong@linux.vnet.ibm.com>
      [updated change log]
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
  4. 28 Feb, 2018 2 commits
    • Max Gurtovoy's avatar
      nvmet: fix PSDT field check in command format · bffd2b61
      Max Gurtovoy authored
      PSDT field section according to NVM_Express-1.3:
      "This field specifies whether PRPs or SGLs are used for any data
      transfer associated with the command. PRPs shall be used for all
      Admin commands for NVMe over PCIe. SGLs shall be used for all Admin
      and I/O commands for NVMe over Fabrics. This field shall be set to
      01b for NVMe over Fabrics 1.0 implementations.
      Suggested-by: default avatarIdan Burstein <idanb@mellanox.com>
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
    • Baegjae Sung's avatar
      nvme-multipath: fix sysfs dangerously created links · 9bd82b1a
      Baegjae Sung authored
      If multipathing is enabled, each NVMe subsystem creates a head
      namespace (e.g., nvme0n1) and multiple private namespaces
      (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
      private namespaces, links of head namespace are used, so the
      namespace creation order must be followed (e.g., nvme0n1 ->
      nvme0c1n1). If the order is not followed, links of sysfs will be
      incomplete or kernel panic will occur.
      The kernel panic was:
        kernel BUG at fs/sysfs/symlink.c:27!
        Call Trace:
          nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
          nvme_validate_ns+0x5c2/0x850 [nvme_core]
          nvme_scan_work+0x1af/0x2d0 [nvme_core]
      Correct order
      Context A     Context B
      nvme0c0n1     nvme0c1n1
      Incorrect order
      Context A     Context B
      The nvme_mpath_add_disk (for creating head namespace) is called
      just before the nvme_mpath_add_disk_links (for creating private
      namespaces). In nvme_mpath_add_disk, the first context acquires
      the lock of subsystem and creates a head namespace, and other
      contexts do nothing by checking GENHD_FL_UP of a head namespace
      after waiting to acquire the lock. We verified the code with or
      without multipathing using three vendors of dual-port NVMe SSDs.
      Signed-off-by: default avatarBaegjae Sung <baegjae@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
  5. 26 Feb, 2018 1 commit
  6. 22 Feb, 2018 3 commits
  7. 14 Feb, 2018 5 commits
  8. 12 Feb, 2018 1 commit
    • Roland Dreier's avatar
      nvme: Don't use a stack buffer for keep-alive command · 0a34e466
      Roland Dreier authored
      In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
      the stack into blk_execute_rq_nowait().  However, the block layer doesn't
      guarantee that the request is fully queued before blk_execute_rq_nowait()
      returns.  If not, and the request is queued after nvme_keep_alive() returns,
      then we'll end up using stack memory that might have been overwritten to
      form the NVMe command we pass to hardware.
      Fix this by keeping a special command struct in the nvme_ctrl struct right
      next to the delayed work struct used for keep-alives.
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
  9. 11 Feb, 2018 2 commits
    • James Smart's avatar
      nvme_fc: cleanup io completion · c3aedd22
      James Smart authored
      There was some old cold that dealt with complete_rq being called
      prior to the lldd returning the io completion. This is garbage code.
      The complete_rq routine was being called after eh_timeouts were
      called and it was due to eh_timeouts not being handled properly.
      The timeouts were fixed in prior patches so that in general, a
      timeout will initiate an abort and the reset timer restarted as
      the abort operation will take care of completing things. Given the
      reset timer restarted, the erroneous complete_rq calls were eliminated.
      So remove the work that was synchronizing complete_rq with io
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
    • James Smart's avatar
      nvme_fc: correct abort race condition on resets · 3efd6e8e
      James Smart authored
      During reset handling, there is live io completing while the reset
      is taking place. The reset path attempts to abort all outstanding io,
      counting the number of ios that were reset. It then waits for those
      ios to be reclaimed from the lldd before continuing.
      The transport's logic on io state and flag setting was poor, allowing
      ios to complete simultaneous to the abort request. The completed ios
      were counted, but as the completion had already occurred, the
      completion never reduced the count. As the count never zeros, the
      reset/delete never completes.
      Tighten it up by unconditionally changing the op state to completed
      when the io done handler is called.  The reset/abort path now changes
      the op state to aborted, but the abort only continues if the op
      state was live priviously. If complete, the abort is backed out.
      Thus proper counting of io aborts and their completions is working
      Also removed the TERMIO state on the op as it's redundant with the
      op's aborted state.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
  10. 08 Feb, 2018 4 commits
  11. 31 Jan, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei authored
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Tested-by: default avatarLaurence Oberman <loberman@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  12. 26 Jan, 2018 3 commits
  13. 25 Jan, 2018 3 commits
  14. 23 Jan, 2018 1 commit
  15. 17 Jan, 2018 6 commits
  16. 15 Jan, 2018 3 commits