1. 15 Nov, 2017 3 commits
    • Luca Miccio's avatar
      block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP · a33801e8
      Luca Miccio authored
      BFQ currently creates, and updates, its own instance of the whole
      set of blkio statistics that cfq creates. Yet, from the comments
      of Tejun Heo in [1], it turned out that most of these statistics
      are meant/useful only for debugging. This commit makes BFQ create
      the latter, debugging statistics only if the option
      By doing so, this commit also enables BFQ to enjoy a high perfomance
      boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
      BFQ has to update far fewer statistics, and, in particular, not the
      heaviest to update.  To give an idea of the benefits, if
      CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
      with 8 threads doing random I/O in parallel on null_blk (configured
      with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
      (+30%). We have measured similar or even much higher boosts with other
      CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
      been obtained and can be reproduced very easily with the script in [1].
      [1] https://www.spinics.net/lists/linux-block/msg18943.htmlSuggested-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Paolo Valente's avatar
      block, bfq: update blkio stats outside the scheduler lock · 24bfd19b
      Paolo Valente authored
      bfq invokes various blkg_*stats_* functions to update the statistics
      contained in the special files blkio.bfq.* in the blkio controller
      groups, i.e., the I/O accounting related to the proportional-share
      policy provided by bfq. The execution of these functions takes a
      considerable percentage, about 40%, of the total per-request execution
      time of bfq (i.e., of the sum of the execution time of all the bfq
      functions that have to be executed to process an I/O request from its
      creation to its destruction).  This reduces the request-processing
      rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
      the bfq functions that invoke blkg_*stats_* functions cannot be
      executed in parallel with the rest of the code of bfq, because both
      are executed under the same same per-device scheduler lock.
      To reduce this slowdown, this commit moves, wherever possible, the
      invocation of these functions (more precisely, of the bfq functions
      that invoke blkg_*stats_* functions) outside the critical sections
      protected by the scheduler lock.
      With this change, and with all blkio.bfq.* statistics enabled, the
      throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
      i7-4850HQ, in case of 8 threads doing random I/O in parallel on
      null_blk, with the latter configured with 0 latency. We obtained the
      same or higher throughput boosts, up to +30%, with other processors
      (some figures are reported in the documentation). For our tests, we
      used the script [1], with which our results can be easily reproduced.
      NOTE. This commit still protects the invocation of blkg_*stats_*
      functions with the request_queue lock, because the group these
      functions are invoked on may otherwise disappear before or while these
      functions are executed.  Fortunately, tests without even this lock
      show, by difference, that the serialization caused by this lock has a
      little impact (at most ~5% of throughput reduction).
      [1] https://github.com/Algodev-github/IOSpeedTested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Paolo Valente's avatar
      doc, block, bfq: update max IOPS sustainable with BFQ · 68017e5d
      Paolo Valente authored
      We have investigated more deeply the performance of BFQ, in terms of
      number of IOPS that can be processed by the CPU when BFQ is used as
      I/O scheduler. In more detail, using the script [1], we have measured
      the number of IOPS reached on top of a null block device configured
      with zero latency, as a function of the workload (sequential read,
      sequential write, random read, random write) and of the system (we
      considered desktops, laptops and embedded systems).
      Basing on the resulting figures, with this commit we update the
      current, conservative IOPS range reported in BFQ documentation. In
      particular, the documentation now reports, for each of three different
      systems, the lowest number of IOPS obtained for that system with the
      above test (namely, the value obtained with the workload leading to
      the lowest IOPS).
      [1] https://github.com/Algodev-github/IOSpeedReviewed-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  2. 11 Nov, 2017 2 commits
  3. 07 Nov, 2017 2 commits
  4. 13 Oct, 2017 2 commits
  5. 31 Aug, 2017 2 commits
  6. 03 Jul, 2017 1 commit
  7. 18 Jun, 2017 1 commit
  8. 10 May, 2017 1 commit
    • Paolo Valente's avatar
      block, bfq: stress that low_latency must be off to get max throughput · 43c1b3d6
      Paolo Valente authored
      The introduction of the BFQ and Kyber I/O schedulers has triggered a
      new wave of I/O benchmarks. Unfortunately, comments and discussions on
      these benchmarks confirm that there is still little awareness that it
      is very hard to achieve, at the same time, a low latency and a high
      throughput. In particular, virtually all benchmarks measure
      throughput, or throughput-related figures of merit, but, for BFQ, they
      use the scheduler in its default configuration. This configuration is
      geared, instead, toward a low latency. This is evidently a sign that
      BFQ documentation is still too unclear on this important aspect. This
      commit addresses this issue by stressing how BFQ configuration must be
      (easily) changed if the only goal is maximum throughput.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  9. 19 Apr, 2017 3 commits
    • Paolo Valente's avatar
      block, bfq: improve responsiveness · 44e44a1b
      Paolo Valente authored
      This patch introduces a simple heuristic to load applications quickly,
      and to perform the I/O requested by interactive applications just as
      quickly. To this purpose, both a newly-created queue and a queue
      associated with an interactive application (we explain in a moment how
      BFQ decides whether the associated application is interactive),
      receive the following two special treatments:
      1) The weight of the queue is raised.
      2) The queue unconditionally enjoys device idling when it empties; in
      fact, if the requests of a queue are sync, then performing device
      idling for the queue is a necessary condition to guarantee that the
      queue receives a fraction of the throughput proportional to its weight
      (see [1] for details).
      For brevity, we call just weight-raising the combination of these
      two preferential treatments. For a newly-created queue,
      weight-raising starts immediately and lasts for a time interval that:
      1) depends on the device speed and type (rotational or
      non-rotational), and 2) is equal to the time needed to load (start up)
      a large-size application on that device, with cold caches and with no
      additional workload.
      Finally, as for guaranteeing a fast execution to interactive,
      I/O-related tasks (such as opening a file), consider that any
      interactive application blocks and waits for user input both after
      starting up and after executing some task. After a while, the user may
      trigger new operations, after which the application stops again, and
      so on. Accordingly, the low-latency heuristic weight-raises again a
      queue in case it becomes backlogged after being idle for a
      sufficiently long (configurable) time. The weight-raising then lasts
      for the same time as for a just-created queue.
      According to our experiments, the combination of this low-latency
      heuristic and of the improvements described in the previous patch
      allows BFQ to guarantee a high application responsiveness.
      [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
          Scheduler", Proceedings of the First Workshop on Mobile System
          Technologies (MST-2015), May 2015.
          http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdfSigned-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Arianna Avanzini's avatar
      block, bfq: add full hierarchical scheduling and cgroups support · e21b7a0b
      Arianna Avanzini authored
      Add complete support for full hierarchical scheduling, with a cgroups
      interface. Full hierarchical scheduling is implemented through the
      'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
      associated with processes, and groups are represented in general by
      entities. Given the bfq_queues associated with the processes belonging
      to a given group, the entities representing these queues are sons of
      the entity representing the group. At higher levels, if a group, say
      G, contains other groups, then the entity representing G is the parent
      entity of the entities representing the groups in G.
      Hierarchical scheduling is performed as follows: if the timestamps of
      a leaf entity (i.e., of a bfq_queue) change, and such a change lets
      the entity become the next-to-serve entity for its parent entity, then
      the timestamps of the parent entity are recomputed as a function of
      the budget of its new next-to-serve leaf entity. If the parent entity
      belongs, in its turn, to a group, and its new timestamps let it become
      the next-to-serve for its parent entity, then the timestamps of the
      latter parent entity are recomputed as well, and so on. When a new
      bfq_queue must be set in service, the reverse path is followed: the
      next-to-serve highest-level entity is chosen, then its next-to-serve
      child entity, and so on, until the next-to-serve leaf entity is
      reached, and the bfq_queue that this entity represents is set in
      Writeback is accounted for on a per-group basis, i.e., for each group,
      the async I/O requests of the processes of the group are enqueued in a
      distinct bfq_queue, and the entity associated with this queue is a
      child of the entity associated with the group.
      Weights can be assigned explicitly to groups and processes through the
      cgroups interface, differently from what happens, for single
      processes, if the cgroups interface is not used (as explained in the
      description of the previous patch). In particular, since each node has
      a full scheduler, each group can be assigned its own weight.
      Signed-off-by: default avatarFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Paolo Valente's avatar
      block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler · aee69d78
      Paolo Valente authored
      We tag as v0 the version of BFQ containing only BFQ's engine plus
      hierarchical support. BFQ's engine is introduced by this commit, while
      hierarchical support is added by next commit. We use the v0 tag to
      distinguish this minimal version of BFQ from the versions containing
      also the features and the improvements added by next commits. BFQ-v0
      coincides with the version of BFQ submitted a few years ago [1], apart
      from the introduction of preemption, described below.
      BFQ is a proportional-share I/O scheduler, whose general structure,
      plus a lot of code, are borrowed from CFQ.
      - Each process doing I/O on a device is associated with a weight and a
      - BFQ grants exclusive access to the device, for a while, to one queue
        (process) at a time, and implements this service model by
        associating every queue with a budget, measured in number of
        - After a queue is granted access to the device, the budget of the
          queue is decremented, on each request dispatch, by the size of the
        - The in-service queue is expired, i.e., its service is suspended,
          only if one of the following events occurs: 1) the queue finishes
          its budget, 2) the queue empties, 3) a "budget timeout" fires.
          - The budget timeout prevents processes doing random I/O from
            holding the device for too long and dramatically reducing
          - Actually, as in CFQ, a queue associated with a process issuing
            sync requests may not be expired immediately when it empties. In
            contrast, BFQ may idle the device for a short time interval,
            giving the process the chance to go on being served if it issues
            a new request in time. Device idling typically boosts the
            throughput on rotational devices, if processes do synchronous
            and sequential I/O. In addition, under BFQ, device idling is
            also instrumental in guaranteeing the desired throughput
            fraction to processes issuing sync requests (see [2] for
            - With respect to idling for service guarantees, if several
              processes are competing for the device at the same time, but
              all processes (and groups, after the following commit) have
              the same weight, then BFQ guarantees the expected throughput
              distribution without ever idling the device. Throughput is
              thus as high as possible in this common scenario.
        - Queues are scheduled according to a variant of WF2Q+, named
          B-WF2Q+, and implemented using an augmented rb-tree to preserve an
          O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
          also ready for hierarchical scheduling. However, for a cleaner
          logical breakdown, the code that enables and completes
          hierarchical support is provided in the next commit, which focuses
          exactly on this feature.
        - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
          perfectly fair, and smooth service. In particular, B-WF2Q+
          guarantees that each queue receives a fraction of the device
          throughput proportional to its weight, even if the throughput
          fluctuates, and regardless of: the device parameters, the current
          workload and the budgets assigned to the queue.
        - The last, budget-independence, property (although probably
          counterintuitive in the first place) is definitely beneficial, for
          the following reasons:
          - First, with any proportional-share scheduler, the maximum
            deviation with respect to an ideal service is proportional to
            the maximum budget (slice) assigned to queues. As a consequence,
            BFQ can keep this deviation tight not only because of the
            accurate service of B-WF2Q+, but also because BFQ *does not*
            need to assign a larger budget to a queue to let the queue
            receive a higher fraction of the device throughput.
          - Second, BFQ is free to choose, for every process (queue), the
            budget that best fits the needs of the process, or best
            leverages the I/O pattern of the process. In particular, BFQ
            updates queue budgets with a simple feedback-loop algorithm that
            allows a high throughput to be achieved, while still providing
            tight latency guarantees to time-sensitive applications. When
            the in-service queue expires, this algorithm computes the next
            budget of the queue so as to:
            - Let large budgets be eventually assigned to the queues
              associated with I/O-bound applications performing sequential
              I/O: in fact, the longer these applications are served once
              got access to the device, the higher the throughput is.
            - Let small budgets be eventually assigned to the queues
              associated with time-sensitive applications (which typically
              perform sporadic and short I/O), because, the smaller the
              budget assigned to a queue waiting for service is, the sooner
              B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
      - Weights can be assigned to processes only indirectly, through I/O
        priorities, and according to the relation:
        weight = 10 * (IOPRIO_BE_NR - ioprio).
        The next patch provides, instead, a cgroups interface through which
        weights can be assigned explicitly.
      - If several processes are competing for the device at the same time,
        but all processes and groups have the same weight, then BFQ
        guarantees the expected throughput distribution without ever idling
        the device. It uses preemption instead. Throughput is then much
        higher in this common scenario.
      - ioprio classes are served in strict priority order, i.e.,
        lower-priority queues are not served as long as there are
        higher-priority queues.  Among queues in the same class, the
        bandwidth is distributed in proportion to the weight of each
        queue. A very thin extra bandwidth is however guaranteed to the Idle
        class, to prevent it from starving.
      - If the strict_guarantees parameter is set (default: unset), then BFQ
           - always performs idling when the in-service queue becomes empty;
           - forces the device to serve one I/O request at a time, by
             dispatching a new request only if there is no outstanding
        In the presence of differentiated weights or I/O-request sizes,
        both the above conditions are needed to guarantee that every
        queue receives its allotted share of the bandwidth (see
        Documentation/block/bfq-iosched.txt for more details). Setting
        strict_guarantees may evidently affect throughput.
      [1] https://lkml.org/lkml/2008/4/1/234
      [2] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
      Signed-off-by: default avatarFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  10. 14 Apr, 2017 1 commit
    • Omar Sandoval's avatar
      blk-mq: introduce Kyber multiqueue I/O scheduler · 00e04393
      Omar Sandoval authored
      The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
      scale to multiple queues. Users configure only two knobs, the target
      read and synchronous write latencies, and the scheduler tunes itself to
      achieve that latency goal.
      The implementation is based on "tokens", built on top of the scalable
      bitmap library. Tokens serve as a mechanism for limiting requests. There
      are two tiers of tokens: queueing tokens and dispatch tokens.
      A queueing token is required to allocate a request. In fact, these
      tokens are actually the blk-mq internal scheduler tags, but the
      scheduler manages the allocation directly in order to implement its
      Dispatch tokens are device-wide and split up into two scheduling
      domains: reads vs. writes. Each hardware queue dispatches batches
      round-robin between the scheduling domains as long as tokens are
      available for that domain.
      These tokens can be used as the mechanism to enable various policies.
      The policy Kyber uses is inspired by active queue management techniques
      for network routing, similar to blk-wbt. The scheduler monitors
      latencies and scales the number of dispatch tokens accordingly. Queueing
      tokens are used to prevent starvation of synchronous requests by
      asynchronous requests.
      Various extensions are possible, including better heuristics and ionice
      support. The new scheduler isn't set as the default yet.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  11. 08 Apr, 2017 1 commit
  12. 28 Mar, 2017 1 commit
    • Shaohua Li's avatar
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li authored
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  13. 26 Jan, 2017 1 commit
  14. 03 Jan, 2017 1 commit
  15. 28 Nov, 2016 1 commit
  16. 18 Nov, 2016 1 commit
  17. 16 Nov, 2016 1 commit
  18. 10 Nov, 2016 1 commit
    • Jens Axboe's avatar
      block: hook up writeback throttling · 87760e5e
      Jens Axboe authored
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  19. 01 Nov, 2016 1 commit
    • Christoph Hellwig's avatar
      block: replace REQ_NOIDLE with REQ_IDLE · a2b80967
      Christoph Hellwig authored
      Noidle should be the default for writes as seen by all the compounds
      definitions in fs.h using it.  In fact only direct I/O really should
      be using NODILE, so turn the whole flag around to get the defaults
      right, which will make our life much easier especially onces the
      WRITE_* defines go away.
      This assumes all the existing "raw" users of REQ_SYNC for writes
      want noidle behavior, which seems to be spot on from a quick audit.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  20. 28 Oct, 2016 2 commits
    • Christoph Hellwig's avatar
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig authored
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Christoph Hellwig's avatar
      block: split out request-only flags into a new namespace · e8064021
      Christoph Hellwig authored
      A lot of the REQ_* flags are only used on struct requests, and only of
      use to the block layer and a few drivers that dig into struct request
      This patch adds a new req_flags_t rq_flags field to struct request for
      them, and thus dramatically shrinks the number of common requests.  It
      also removes the unfortunate situation where we have to fit the fields
      from the same enum into 32 bits for struct bio and 64 bits for
      struct request.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  21. 14 Sep, 2016 1 commit
  22. 11 Aug, 2016 1 commit
  23. 07 Aug, 2016 1 commit
    • Jens Axboe's avatar
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe authored
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      No intended functional changes in this commit.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  24. 01 Jul, 2016 1 commit
  25. 28 Jun, 2016 1 commit
  26. 07 Jun, 2016 2 commits
  27. 12 Apr, 2016 2 commits
  28. 31 Mar, 2016 1 commit
  29. 10 Dec, 2015 1 commit