1. 21 Oct, 2014 1 commit
    • Christoph Hellwig's avatar
      block: remove artifical max_hw_sectors cap · 34b48db6
      Christoph Hellwig authored
      
      
      Set max_sectors to the value the drivers provides as hardware limit by
      default.  Linux had proper I/O throttling for a long time and doesn't
      rely on a artifically small maximum I/O size anymore.  By not limiting
      the I/O size by default we remove an annoying tuning step required for
      most Linux installation.
      
      Note that both the user, and if absolutely required the driver can still
      impose a limit for FS requests below max_hw_sectors_kb.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      34b48db6
  2. 10 Oct, 2014 1 commit
  3. 09 Oct, 2014 1 commit
  4. 27 Sep, 2014 8 commits
  5. 25 Sep, 2014 1 commit
  6. 08 Sep, 2014 1 commit
    • Tejun Heo's avatar
      block, bdi: an active gendisk always has a request_queue associated with it · ff9ea323
      Tejun Heo authored
      
      
      bdev_get_queue() returns the request_queue associated with the
      specified block_device.  blk_get_backing_dev_info() makes use of
      bdev_get_queue() to determine the associated bdi given a block_device.
      
      All the callers of bdev_get_queue() including
      blk_get_backing_dev_info() assume that bdev_get_queue() may return
      NULL and implement NULL handling; however, bdev_get_queue() requires
      the passed in block_device is opened and attached to its gendisk.
      Because an active gendisk always has a valid request_queue associated
      with it, bdev_get_queue() can never return NULL and neither can
      blk_get_backing_dev_info().
      
      Make it clear that neither of the two functions can return NULL and
      remove NULL handling from all the callers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ff9ea323
  7. 01 Jul, 2014 2 commits
    • Tejun Heo's avatar
      blk-mq: use percpu_ref for mq usage count · add703fd
      Tejun Heo authored
      
      
      Currently, blk-mq uses a percpu_counter to keep track of how many
      usages are in flight.  The percpu_counter is drained while freezing to
      ensure that no usage is left in-flight after freezing is complete.
      blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
      per-cpu gating mechanism.
      
      This type of code has relatively high chance of subtle bugs which are
      extremely difficult to trigger and it's way too hairy to be open coded
      in blk-mq.  percpu_ref can serve the same purpose after the recent
      changes.  This patch replaces the open-coded per-cpu usage counting
      and draining mechanism with percpu_ref.
      
      blk_mq_queue_enter() performs tryget_live on the ref and exit()
      performs put.  blk_mq_freeze_queue() kills the ref and waits until the
      reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
      and wakes up the waiters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      add703fd
    • Tejun Heo's avatar
      blk-mq: decouble blk-mq freezing from generic bypassing · 780db207
      Tejun Heo authored
      
      
      blk_mq freezing is entangled with generic bypassing which bypasses
      blkcg and io scheduler and lets IO requests fall through the block
      layer to the drivers in FIFO order.  This allows forward progress on
      IOs with the advanced features disabled so that those features can be
      configured or altered without worrying about stalling IO which may
      lead to deadlock through memory allocation.
      
      However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
      doesn't make use of blkcg or ioscheds and it maps bypssing to
      freezing, which blocks request processing and drains all the in-flight
      ones.  This causes problems as bypassing assumes that request
      processing is online.  blk-mq works around this by conditionally
      allowing request processing for the problem case - during queue
      initialization.
      
      Another weirdity is that except for during queue cleanup, bypassing
      started on the generic side prevents blk-mq from processing new
      requests but doesn't drain the in-flight ones.  This shouldn't break
      anything but again highlights that something isn't quite right here.
      
      The root cause is conflating blk-mq freezing and generic bypassing
      which are two different mechanisms.  The only intersecting purpose
      that they serve is during queue cleanup.  Let's properly separate
      blk-mq freezing from generic bypassing and simply use it where
      necessary.
      
      * request_queue->mq_freeze_depth is added and
        blk_mq_[un]freeze_queue() now operate on this counter instead of
        ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
        but the counter is tested directly.  This will be further updated by
        later changes.
      
      * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
        blk_mq_freeze_queue().  Queue cleanup path now calls
        blk_mq_freeze_queue() directly.
      
      * blk_queue_enter()'s fast path condition is simplified to simply
        check @q->mq_freeze_depth.  Previously, the condition was
      
      	!blk_queue_dying(q) &&
      	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))
      
        mq_freeze_depth is incremented right after dying is set and
        blk_queue_init_done() exception isn't necessary as blk-mq doesn't
        start frozen, which only leaves the blk_queue_bypass() test which
        can be replaced by @q->mq_freeze_depth test.
      
      This change simplifies the code and reduces confusion in the area.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      780db207
  8. 24 Jun, 2014 1 commit
    • Jens Axboe's avatar
      block: add support for limiting gaps in SG lists · 66cb45aa
      Jens Axboe authored
      
      
      Another restriction inherited for NVMe - those devices don't support
      SG lists that have "gaps" in them. Gaps refers to cases where the
      previous SG entry doesn't end on a page boundary. For NVMe, all SG
      entries must start at offset 0 (except the first) and end on a page
      boundary (except the last).
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      66cb45aa
  9. 18 Jun, 2014 1 commit
  10. 06 Jun, 2014 1 commit
    • Jens Axboe's avatar
      block: add blk_rq_set_block_pc() · f27b087b
      Jens Axboe authored
      
      
      With the optimizations around not clearing the full request at alloc
      time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
      up to the user allocating the request.
      
      Add a blk_rq_set_block_pc() that sets the command type to
      REQ_TYPE_BLOCK_PC, and properly initializes the members associated
      with this type of request. Update callers to use this function instead
      of manipulating rq->cmd_type directly.
      
      Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
      attempt.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f27b087b
  11. 05 Jun, 2014 1 commit
    • Jens Axboe's avatar
      block: add notion of a chunk size for request merging · 762380ad
      Jens Axboe authored
      
      
      Some drivers have different limits on what size a request should
      optimally be, depending on the offset of the request. Similar to
      dividing a device into chunks. Add a setting that allows the driver
      to inform the block layer of such a chunk size. The block layer will
      then prevent merging across the chunks.
      
      This is needed to optimally support NVMe with a non-zero stripe size.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      762380ad
  12. 04 Jun, 2014 3 commits
  13. 29 May, 2014 2 commits
    • Jens Axboe's avatar
      block: add queue flag for disabling SG merging · 05f1dd53
      Jens Axboe authored
      
      
      If devices are not SG starved, we waste a lot of time potentially
      collapsing SG segments. Enough that 1.5% of the CPU time goes
      to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
      which just returns the number of vectors in a bio instead of looping
      over all segments and checking for collapsible ones.
      
      Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
      merging, if they so desire.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      05f1dd53
    • Jens Axboe's avatar
      block: remove 'magic' from struct blk_plug · 4d92a9be
      Jens Axboe authored
      
      
      I don't think we've ever caught any bugs with this, and there's the
      list poisoning for the plug lists to catch uninitialized cases.
      So remove the magic member and save 8 bytes in the struct.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4d92a9be
  14. 28 May, 2014 1 commit
  15. 13 May, 2014 1 commit
    • Jens Axboe's avatar
      blk-mq: improve support for shared tags maps · 0d2602ca
      Jens Axboe authored
      
      
      This adds support for active queue tracking, meaning that the
      blk-mq tagging maintains a count of active users of a tag set.
      This allows us to maintain a notion of fairness between users,
      so that we can distribute the tag depth evenly without starving
      some users while allowing others to try unfair deep queues.
      
      If sharing of a tag set is detected, each hardware queue will
      track the depth of its own queue. And if this exceeds the total
      depth divided by the number of active queues, the user is actively
      throttled down.
      
      The active queue count is done lazily to avoid bouncing that data
      between submitter and completer. Each hardware queue gets marked
      active when it allocates its first tag, and gets marked inactive
      when 1) the last tag is cleared, and 2) the queue timeout grace
      period has passed.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0d2602ca
  16. 09 May, 2014 1 commit
  17. 16 Apr, 2014 3 commits
  18. 15 Apr, 2014 2 commits
  19. 10 Apr, 2014 1 commit
    • Jens Axboe's avatar
      block: fix regression with block enabled tagging · 360f92c2
      Jens Axboe authored
      Martin reported that his test system would not boot with
      current git, it oopsed with this:
      
      BUG: unable to handle kernel paging request at ffff88046c6c9e80
      IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
      PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
      Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
      rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
      CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
      Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
      Workqueue: events_unbound async_run_entry_fn
      task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
      RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
      blk_queue_start_tag+0x90/0x150
      RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
      RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
      RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
      RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
      R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
      R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
      FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
      Stack:
       ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
       ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
       ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
      Call Trace:
       [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
       [<ffffffff81293167>] __blk_run_queue+0x37/0x50
       [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
       [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
       [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
       [<ffffffff813bba35>] scsi_execute+0xe5/0x180
       [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
       [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
       [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
       [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
       [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
       [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
       [<ffffffff8106a1c5>] process_one_work+0x175/0x410
       [<ffffffff8106b373>] worker_thread+0x123/0x400
       [<ffffffff8106b250>] ? manage_workers+0x160/0x160
       [<ffffffff8107104e>] kthread+0xce/0xf0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
      Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
      df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
      58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
      RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
       RSP <ffff880273d03a58>
      CR2: ffff88046c6c9e80
      
      Martin bisected and found this to be the problem patch;
      
      	commit 6d113398
      
      
      	Author: Jan Kara <jack@suse.cz>
      	Date:   Mon Feb 24 16:39:54 2014 +0100
      
      	    block: Stop abusing rq->csd.list in blk-softirq
      
      and the problem was immediately apparent. The patch states that
      it is safe to reuse queuelist at completion time, since it is
      no longer used. However, that is not true if a device is using
      block enabled tagging. If that is the case, then the queuelist
      is reused to keep track of busy tags. If a device also ended
      up using softirq completions, we'd reuse ->queuelist for the
      IPI handling while block tagging was still using it. Boom.
      
      Fix this by adding a new ipi_list list head, and share the
      memory used with the request hash table. The hash table is
      never used after the request is moved to the dispatch list,
      which happens long before any potential completion of the
      request. Add a new request bit for this, so we don't have
      cases that check rq->hash while it could potentially have
      been reused for the IPI completion.
      Reported-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Tested-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      360f92c2
  20. 09 Apr, 2014 2 commits
  21. 02 Apr, 2014 1 commit
  22. 24 Feb, 2014 1 commit
  23. 10 Feb, 2014 1 commit
    • Christoph Hellwig's avatar
      blk-mq: rework flush sequencing logic · 18741986
      Christoph Hellwig authored
      
      
      Witch to using a preallocated flush_rq for blk-mq similar to what's done
      with the old request path.  This allows us to set up the request properly
      with a tag from the actually allowed range and ->rq_disk as needed by
      some drivers.  To make life easier we also switch to dynamic allocation
      of ->flush_rq for the old path.
      
      This effectively reverts most of
      
          "blk-mq: fix for flush deadlock"
      
      and
      
          "blk-mq: Don't reserve a tag for flush request"
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      18741986
  24. 31 Jan, 2014 1 commit
  25. 08 Jan, 2014 1 commit
    • Kent Overstreet's avatar
      bcache/md: Use raid stripe size · c78afc62
      Kent Overstreet authored
      
      
      Now that we've got code for raid5/6 stripe awareness, bcache just needs
      to know about the stripes and when writing partial stripes is expensive
      - we probably don't want to enable this optimization for raid1 or 10,
      even though they have stripes. So add a flag to queue_limits.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      c78afc62