1. 19 Sep, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: fix updating tags depth · abe0bde4
      Ming Lei authored
      [ Upstream commit 75d6e175 ]
      
      The passed 'nr' from userspace represents the total depth, meantime
      inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
      and 'nr_reserved_tags' stores the reserved part.
      
      There are two issues in blk_mq_tag_update_depth() now:
      
      1) for growing tags, we should have used the passed 'nr', and keep the
      number of reserved tags not changed.
      
      2) the passed 'nr' should have been used for checking against
      'tags->nr_tags', instead of number of the normal part.
      
      This patch fixes the above two cases, and avoids kernel crash caused
      by wrong resizing sbitmap queue.
      
      Cc: "Ewan D. Milne" <emilne@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Tested by: Marco Patalano <mpatalan@redhat.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abe0bde4
  2. 15 Sep, 2018 1 commit
  3. 02 Aug, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: fix blk_mq_tagset_busy_iter · 2d5ba0e2
      Ming Lei authored
      Commit d250bf4e("blk-mq: only iterate over inflight requests
      in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
      to replace 'blk_mq_request_started(req)', this way is wrong, and causes
      lots of test system hang during booting.
      
      Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().
      
      Fixes: d250bf4e ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Matt Hart <matthew.hart@linaro.org>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: John Garry <john.garry@huawei.com>
      Cc: Hannes Reinecke <hare@suse.com>,
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: 's avatarBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: 's avatarGuenter Roeck <linux@roeck-us.net>
      Reported-by: 's avatarMark Brown <broonie@kernel.org>
      Reported-by: 's avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      2d5ba0e2
  4. 14 Jun, 2018 1 commit
  5. 30 May, 2018 1 commit
  6. 24 May, 2018 1 commit
    • Ming Lei's avatar
      blk-mq: avoid starving tag allocation after allocating process migrates · e6fc4649
      Ming Lei authored
      When the allocation process is scheduled back and the mapped hw queue is
      changed, fake one extra wake up on previous queue for compensating wake
      up miss, so other allocations on the previous queue won't be starved.
      
      This patch fixes one request allocation hang issue, which can be
      triggered easily in case of very low nr_request.
      
      The race is as follows:
      
      1) 2 hw queues, nr_requests are 2, and wake_batch is one
      
      2) there are 3 waiters on hw queue 0
      
      3) two in-flight requests in hw queue 0 are completed, and only two
         waiters of 3 are waken up because of wake_batch, but both the two
         waiters can be scheduled to another CPU and cause to switch to hw
         queue 1
      
      4) then the 3rd waiter will wait for ever, since no in-flight request
         is in hw queue 0 any more.
      
      5) this patch fixes it by the fake wakeup when waiter is scheduled to
         another hw queue
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: 's avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: 's avatarMing Lei <ming.lei@redhat.com>
      
      Modified commit message to make it clearer, and make it apply on
      top of the 4.18 branch.
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      e6fc4649
  7. 22 Dec, 2017 1 commit
    • Jens Axboe's avatar
      blk-mq: improve heavily contended tag case · 4e5dff41
      Jens Axboe authored
      Even with a number of waitqueues, we can get into a situation where we
      are heavily contended on the waitqueue lock. I got a report on spc1
      where we're spending seconds doing this. Arguably the use case is nasty,
      I reproduce it with one device and 1000 threads banging on the device.
      But that doesn't mean we shouldn't be handling it better.
      
      What ends up happening is that a thread will fail to get a tag, add
      itself to the waitqueue, and subsequently get woken up when a tag is
      freed - only to find itself going back to sleep on the waitqueue.
      
      Instead of waking all threads, use an exclusive wait and wake up our
      sbitmap batch count instead. This seems to work well for me (massive
      improvement for this use case), and it survives basic testing. But I
      haven't fully verified it yet.
      
      An additional improvement is running the queue and checking for a new
      tag BEFORE needing to add ourselves to the waitqueue.
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      4e5dff41
  8. 18 Oct, 2017 2 commits
  9. 18 Aug, 2017 1 commit
  10. 09 Aug, 2017 1 commit
  11. 14 Apr, 2017 1 commit
  12. 13 Mar, 2017 1 commit
  13. 02 Mar, 2017 1 commit
  14. 27 Jan, 2017 2 commits
  15. 25 Jan, 2017 1 commit
  16. 20 Jan, 2017 1 commit
    • Jens Axboe's avatar
      blk-mq: allow resize of scheduler requests · 70f36b60
      Jens Axboe authored
      Add support for growing the tags associated with a hardware queue, for
      the scheduler tags. Currently we only support resizing within the
      limits of the original depth, change that so we can grow it as well by
      allocating and replacing the existing scheduler tag set.
      
      This is similar to how we could increase the software queue depth with
      the legacy IO stack and schedulers.
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      Reviewed-by: 's avatarOmar Sandoval <osandov@fb.com>
      70f36b60
  17. 19 Jan, 2017 1 commit
  18. 17 Jan, 2017 2 commits
  19. 17 Sep, 2016 4 commits
  20. 15 Sep, 2016 2 commits
  21. 08 Jul, 2016 1 commit
  22. 12 Apr, 2016 2 commits
  23. 01 Dec, 2015 1 commit
  24. 07 Nov, 2015 1 commit
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: 's avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  25. 15 Oct, 2015 1 commit
  26. 09 Oct, 2015 1 commit
    • Kosuke Tatsukawa's avatar
      blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c · 8ee1b7b9
      Kosuke Tatsukawa authored
      blk_mq_tag_update_depth() seems to be missing a memory barrier which
      might cause the waker to not notice the waiter and fail to send a
      wake_up as in the following figure.
      
      	blk_mq_tag_update_depth			bt_get
      ------------------------------------------------------------------------
      if (waitqueue_active(&bs->wait))
      /* The CPU might reorder the test for
         the waitqueue up here, before
         prior writes complete */
      					prepare_to_wait(&bs->wait, &wait,
      					  TASK_UNINTERRUPTIBLE);
      					tag = __bt_get(hctx, bt, last_tag,
      					  tags);
      					/* Value set in bt_update_count not
      					   visible yet */
      bt_update_count(&tags->bitmap_tags, tdepth);
      /* blk_mq_tag_wakeup_all(tags, false); */
       bt = &tags->bitmap_tags;
       wake_index = atomic_read(&bt->wake_index);
      					...
      					io_schedule();
      ------------------------------------------------------------------------
      
      This patch adds the missing memory barrier.
      
      I found this issue when I was looking through the linux source code
      for places calling waitqueue_active() before wake_up*(), but without
      preceding memory barriers, after sending a patch to fix a similar
      issue in drivers/tty/n_tty.c  (Details about the original issue can be
      found here: https://lkml.org/lkml/2015/9/28/849).
      Signed-off-by: 's avatarKosuke Tatsukawa <tatsu@ab.jp.nec.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      8ee1b7b9
  27. 01 Oct, 2015 1 commit
  28. 15 Aug, 2015 1 commit
    • Ming Lei's avatar
      blk-mq: fix race between timeout and freeing request · 0048b483
      Ming Lei authored
      Inside timeout handler, blk_mq_tag_to_rq() is called
      to retrieve the request from one tag. This way is obviously
      wrong because the request can be freed any time and some
      fiedds of the request can't be trusted, then kernel oops
      might be triggered[1].
      
      Currently wrt. blk_mq_tag_to_rq(), the only special case is
      that the flush request can share same tag with the request
      cloned from, and the two requests can't be active at the same
      time, so this patch fixes the above issue by updating tags->rqs[tag]
      with the active request(either flush rq or the request cloned
      from) of the tag.
      
      Also blk_mq_tag_to_rq() gets much simplified with this patch.
      
      Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
      make sure the request can't be freed, so in bt_for_each() this
      helper is replaced with tags->rqs[tag].
      
      [1] kernel oops log
      [  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
      [  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
      [  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
      [  439.700653] Dumping ftrace buffer:^M
      [  439.700653]    (ftrace buffer empty)^M
      [  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
      [  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
      [  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
      [  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
      [  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
      [  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
      [  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
      [  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
      [  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
      [  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
      [  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
      [  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      [  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
      [  439.730500] Stack:^M
      [  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
      [  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
      [  439.755663] Call Trace:^M
      [  439.755663]  <IRQ> ^M
      [  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
      [  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
      [  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
      [  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
      [  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
      [  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
      [  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
      [  439.755663]  <EOI> ^M
      [  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
      [  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
      [  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
      [  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
      [  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
      [  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
      [  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
      [  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
      [  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
      [  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
      [  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
      [  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
      [  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
      f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
      53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
      ^M
      [  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.790911]  RSP <ffff880819203da0>^M
      [  439.790911] CR2: 0000000000000158^M
      [  439.790911] ---[ end trace d40af58949325661 ]---^M
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      0048b483
  29. 01 Jun, 2015 1 commit
    • Keith Busch's avatar
      blk-mq: Shared tag enhancements · f26cdc85
      Keith Busch authored
      Storage controllers may expose multiple block devices that share hardware
      resources managed by blk-mq. This patch enhances the shared tags so a
      low-level driver can access the shared resources not tied to the unshared
      h/w contexts. This way the LLD can dynamically add and delete disks and
      request queues without having to track all the request_queue hctx's to
      iterate outstanding tags.
      Signed-off-by: 's avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      f26cdc85
  30. 18 Mar, 2015 1 commit
  31. 11 Feb, 2015 1 commit
  32. 23 Jan, 2015 1 commit
    • Shaohua Li's avatar
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li authored
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: 's avatarShaohua Li <shli@fb.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      24391c0d