1. 07 Dec, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: fix decrement of num_active_groups · ba7aeae5
      Paolo Valente authored
      Since commit '2d29c9f8 ("block, bfq: improve asymmetric scenarios
      detection")', if there are process groups with I/O requests waiting for
      completion, then BFQ tags the scenario as 'asymmetric'. This detection
      is needed for preserving service guarantees (for details, see comments
      on the computation * of the variable asymmetric_scenario in the
      function bfq_better_to_idle).
      
      Unfortunately, commit '2d29c9f8 ("block, bfq: improve asymmetric
      scenarios detection")' contains an error exactly in the updating of
      the number of groups with I/O requests waiting for completion: if a
      group has more than one descendant process, then the above number of
      groups, which is renamed from num_active_groups to a more appropriate
      num_groups_with_pending_reqs by this commit, may happen to be wrongly
      decremented multiple times, namely every time one of the descendant
      processes gets all its pending I/O requests completed.
      
      A correct, complete solution should work as follows. Consider a group
      that is inactive, i.e., that has no descendant process with pending
      I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
      is still accounting for this group, because the group still has some
      descendant process with some I/O request still in
      flight. num_groups_with_pending_reqs should be decremented when the
      in-flight request of the last descendant process is finally completed
      (assuming that nothing else has changed for the group in the meantime,
      in terms of composition of the group and active/inactive state of
      child groups and processes). To accomplish this, an additional
      pending-request counter must be added to entities, and must be
      updated correctly.
      
      To avoid this additional field and operations, this commit resorts to
      the following tradeoff between simplicity and accuracy: for an
      inactive group that is still counted in num_groups_with_pending_reqs,
      this commit decrements num_groups_with_pending_reqs when the first
      descendant process of the group remains with no request waiting for
      completion.
      
      This simplified scheme provides a fix to the unbalanced decrements
      introduced by 2d29c9f8. Since this error was also caused by lack
      of comments on this non-trivial issue, this commit also adds related
      comments.
      
      Fixes: 2d29c9f8 ("block, bfq: improve asymmetric scenarios detection")
      Reported-by: 's avatarSteven Barrett <steven@liquorix.net>
      Tested-by: 's avatarSteven Barrett <steven@liquorix.net>
      Tested-by: 's avatarLucjan Lucjanov <lucjan.lucjanov@gmail.com>
      Reviewed-by: 's avatarFederico Motta <federico@willer.it>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      ba7aeae5
  2. 25 Oct, 2018 1 commit
    • Federico Motta's avatar
      block, bfq: fix asymmetric scenarios detection · 98fa7a3e
      Federico Motta authored
      Since commit 2d29c9f8 ("block, bfq: improve asymmetric scenarios
      detection"), a scenario is defined asymmetric when one of the
      following conditions holds:
      - active bfq_queues have different weights
      - one or more group of entities (bfq_queue or other groups of entities)
        are active
      bfq grants fairness and low latency also in such asymmetric scenarios,
      by plugging the dispatching of I/O if the bfq_queue in service happens
      to be temporarily idle. This plugging may lower throughput, so it is
      important to do it only when strictly needed.
      
      By mistake, in commit '2d29c9f8' ("block, bfq: improve asymmetric
      scenarios detection") the num_active_groups counter was firstly
      incremented and subsequently decremented at any entity (group or
      bfq_queue) weight change.
      
      This is useless, because only transitions from active to inactive and
      vice versa matter for that counter. Unfortunately this is also
      incorrect in the following case: the entity at issue is a bfq_queue
      and it is under weight raising. In fact in this case there is a
      spurious increment of the num_active_groups counter.
      
      This spurious increment may cause scenarios to be wrongly detected as
      asymmetric, thus causing useless plugging and loss of throughput.
      
      This commit fixes this issue by simply removing the above useless and
      wrong increments and decrements.
      
      Fixes: 2d29c9f8 ("block, bfq: improve asymmetric scenarios detection")
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarFederico Motta <federico@willer.it>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      98fa7a3e
  3. 13 Oct, 2018 1 commit
    • Federico Motta's avatar
      block, bfq: improve asymmetric scenarios detection · 2d29c9f8
      Federico Motta authored
      bfq defines as asymmetric a scenario where an active entity, say E
      (representing either a single bfq_queue or a group of other entities),
      has a higher weight than some other entities.  If the entity E does sync
      I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
      other entities in the following situation: E is in service but
      temporarily has no pending I/O request.  In fact, without this plugging,
      all the times that E stops being temporarily idle, it may find the
      internal queues of the storage device already filled with an
      out-of-control number of extra requests, from other entities. So E may
      have to wait for the service of these extra requests, before finally
      having its own requests served. This may easily break service
      guarantees, with E getting less than its fair share of the device
      throughput.  Usually, the end result is that E gets the same fraction of
      the throughput as the other entities, instead of getting more, according
      to its higher weight.
      
      Yet there are two other more subtle cases where E, even if its weight is
      actually equal to or even lower than the weight of any other active
      entities, may get less than its fair share of the throughput in case the
      above I/O plugging is not performed:
      1. other entities issue larger requests than E;
      2. other entities contain more active child entities than E (or in
         general tend to have more backlog than E).
      
      In the first case, other entities may get more service than E because
      they get larger requests, than those of E, served during the temporary
      idle periods of E.  In the second case, other entities get more service
      because, by having many child entities, they have many requests ready
      for dispatching while E is temporarily idle.
      
      This commit addresses this issue by extending the definition of
      asymmetric scenario: a scenario is asymmetric when
      - active entities representing bfq_queues have differentiated weights,
        as in the original definition
      or (inclusive)
      - one or more entities representing groups of entities are active.
      
      This broader definition makes sure that I/O plugging will be performed
      in all the above cases, provided that there is at least one active
      group.  Of course, this definition is very coarse, so it will trigger
      I/O plugging also in cases where it is not needed, such as, e.g.,
      multiple active entities with just one child each, and all with the same
      I/O-request size.  The reason for this coarse definition is just that a
      finer-grained definition would be rather heavy to compute.
      
      On the opposite end, even this new definition does not trigger I/O
      plugging in all cases where there is no active group, and all bfq_queues
      have the same weight.  So, in these cases some unfairness may occur if
      there are asymmetries in I/O-request sizes.  We made this choice because
      I/O plugging may lower throughput, and probably a user that has not
      created any group cares more about throughput than about perfect
      fairness.  At any rate, as for possible applications that may care about
      service guarantees, bfq already guarantees a high responsiveness and a
      low latency to soft real-time applications automatically.
      Signed-off-by: 's avatarFederico Motta <federico@willer.it>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      2d29c9f8
  4. 14 Sep, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: correctly charge and reset entity service in all cases · cbeb869a
      Paolo Valente authored
      BFQ schedules entities (which represent either per-process queues or
      groups of queues) as a function of their timestamps. In particular, as
      a function of their (virtual) finish times. The finish time of an
      entity is computed as a function of the budget assigned to the entity,
      assuming, tentatively, that the entity, once in service, will receive
      an amount of service equal to its budget. Then, when the entity is
      expired because it finishes to be served, this finish time is updated
      as a function of the actual service received by the entity. This
      allows the entity to be correctly charged with only the service
      received, and then to be correctly re-scheduled.
      
      Yet an entity may receive service also while not being the entity in
      service (in the scheduling environment of its parent entity), for
      several reasons. If the entity remains with no backlog while receiving
      this 'unofficial' service, then it is expired. Also on such an
      expiration, the finish time of the entity should be updated to account
      for only the service actually received by the entity. Unfortunately,
      such an update is not performed for an entity expiring without being
      the entity in service.
      
      In a similar vein, the service counter of the entity in service is
      reset when the entity is expired, to be ready to be used for next
      service cycle. This reset too should be performed also in case an
      entity is expired because it remains empty after receiving service
      while not being the entity in service. But in this case the reset is
      not performed.
      
      This commit performs the above update of the finish time and reset of
      the service received, also for an entity expiring while not being the
      entity in service.
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      cbeb869a
  5. 16 Aug, 2018 2 commits
  6. 09 Jul, 2018 2 commits
    • Paolo Valente's avatar
      block, bfq: fix service being wrongly set to zero in case of preemption · 9fae8dd5
      Paolo Valente authored
      If
      - a bfq_queue Q preempts another queue, because one request of Q
      arrives in time,
      - but, after this preemption, Q is not the queue that is set in service,
      then Q->entity.service is set to 0 when Q is eventually set in
      service. But Q should have continued receiving service with its old
      budget (which is why preemption has occurred) and its old service.
      
      This commit addresses this issue by resetting service on queue real
      expiration.
      Tested-by: 's avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      9fae8dd5
    • Paolo Valente's avatar
      block, bfq: add/remove entity weights correctly · 0471559c
      Paolo Valente authored
      To keep I/O throughput high as often as possible, BFQ performs
      I/O-dispatch plugging (aka device idling) only when beneficial exactly
      for throughput, or when needed for service guarantees (low latency,
      fairness). An important case where the latter condition holds is when
      the scenario is 'asymmetric' in terms of weights: i.e., when some
      bfq_queue or whole group of queues has a higher weight, and thus has
      to receive more service, than other queues or groups. Without dispatch
      plugging, lower-weight queues/groups may unjustly steal bandwidth to
      higher-weight queues/groups.
      
      To detect asymmetric scenarios, BFQ checks some sufficient
      conditions. One of these conditions is that active groups have
      different weights. BFQ controls this condition by maintaining a
      special set of unique weights of active groups
      (group_weights_tree). To this purpose, in the function
      bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
      group to/from this set.
      
      Unfortunately, the function bfq_active_extract may happen to be
      invoked also for a group that is still active (to preserve the correct
      update of the next queue to serve, see comments in function
      bfq_no_longer_next_in_service() for details). In this case, removing
      the weight of the group makes the set group_weights_tree
      inconsistent. Service-guarantee violations follow.
      
      This commit addresses this issue by moving group_weights_tree
      insertions from their previous location (in bfq_active_insert) into
      the function __bfq_activate_entity, and by moving group_weights_tree
      extractions from bfq_active_extract to when the entity that represents
      a group remains throughly idle, i.e., with no request either enqueued
      or dispatched.
      Tested-by: 's avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      0471559c
  7. 18 Jan, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: limit sectors served with interactive weight raising · 8a8747dc
      Paolo Valente authored
      To maximise responsiveness, BFQ raises the weight, and performs device
      idling, for bfq_queues associated with processes deemed as
      interactive. In particular, weight raising has a maximum duration,
      equal to the time needed to start a large application. If a
      weight-raised process goes on doing I/O beyond this maximum duration,
      it loses weight-raising.
      
      This mechanism is evidently vulnerable to the following false
      positives: I/O-bound applications that will go on doing I/O for much
      longer than the duration of weight-raising. These applications have
      basically no benefit from being weight-raised at the beginning of
      their I/O. On the opposite end, while being weight-raised, these
      applications
      a) unjustly steal throughput to applications that may truly need
      low latency;
      b) make BFQ uselessly perform device idling; device idling results
      in loss of device throughput with most flash-based storage, and may
      increase latencies when used purposelessly.
      
      This commit adds a countermeasure to reduce both the above
      problems. To introduce this countermeasure, we provide the following
      extra piece of information (full details in the comments added by this
      commit). During the start-up of the large application used as a
      reference to set the duration of weight-raising, involved processes
      transfer at most ~110K sectors each. Accordingly, a process initially
      deemed as interactive has no right to be weight-raised any longer,
      once transferred 110K sectors or more.
      
      Basing on this consideration, this commit early-ends weight-raising
      for a bfq_queue if the latter happens to have received an amount of
      service at least equal to 110K sectors (actually, a little bit more,
      to keep a safety margin). I/O-bound applications that reach a high
      throughput, such as file copy, get to this threshold much before the
      allowed weight-raising period finishes. Thus this early ending of
      weight-raising reduces the amount of time during which these
      applications cause the problems described above.
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: 's avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      8a8747dc
  8. 05 Jan, 2018 1 commit
    • Paolo Valente's avatar
      block, bfq: let a queue be merged only shortly after starting I/O · 7b8fa3b9
      Paolo Valente authored
      In BFQ and CFQ, two processes are said to be cooperating if they do
      I/O in such a way that the union of their I/O requests yields a
      sequential I/O pattern. To get such a sequential I/O pattern out of
      the non-sequential pattern of each cooperating process, BFQ and CFQ
      merge the queues associated with these processes. In more detail,
      cooperating processes, and thus their associated queues, usually
      start, or restart, to do I/O shortly after each other. This is the
      case, e.g., for the I/O threads of KVM/QEMU and of the dump
      utility. Basing on this assumption, this commit allows a bfq_queue to
      be merged only during a short time interval (100ms) after it starts,
      or re-starts, to do I/O.  This filtering provides two important
      benefits.
      
      First, it greatly reduces the probability that two non-cooperating
      processes have their queues merged by mistake, if they just happen to
      do I/O close to each other for a short time interval. These spurious
      merges cause loss of service guarantees. A low-weight bfq_queue may
      unjustly get more than its expected share of the throughput: if such a
      low-weight queue is merged with a high-weight queue, then the I/O for
      the low-weight queue is served as if the queue had a high weight. This
      may damage other high-weight queues unexpectedly.  For instance,
      because of this issue, lxterminal occasionally took 7.5 seconds to
      start, instead of 6.5 seconds, when some sequential readers and
      writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
      reason is that the bfq_queues associated with some of the readers or
      the writers were merged with the high-weight queues of some processes
      that had to do some urgent but little I/O. The readers then exploited
      the inherited high weight for all or most of their I/O, during the
      start-up of terminal. The filtering introduced by this commit
      eliminated any outlier caused by spurious queue merges in our start-up
      time tests.
      
      This filtering also provides a little boost of the throughput
      sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
      once a bfq_queue cannot be merged any longer, this commit makes BFQ
      stop updating the data needed to handle merging for the queue.
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      7b8fa3b9
  9. 15 Nov, 2017 1 commit
    • Paolo Valente's avatar
      block, bfq: update blkio stats outside the scheduler lock · 24bfd19b
      Paolo Valente authored
      bfq invokes various blkg_*stats_* functions to update the statistics
      contained in the special files blkio.bfq.* in the blkio controller
      groups, i.e., the I/O accounting related to the proportional-share
      policy provided by bfq. The execution of these functions takes a
      considerable percentage, about 40%, of the total per-request execution
      time of bfq (i.e., of the sum of the execution time of all the bfq
      functions that have to be executed to process an I/O request from its
      creation to its destruction).  This reduces the request-processing
      rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
      the bfq functions that invoke blkg_*stats_* functions cannot be
      executed in parallel with the rest of the code of bfq, because both
      are executed under the same same per-device scheduler lock.
      
      To reduce this slowdown, this commit moves, wherever possible, the
      invocation of these functions (more precisely, of the bfq functions
      that invoke blkg_*stats_* functions) outside the critical sections
      protected by the scheduler lock.
      
      With this change, and with all blkio.bfq.* statistics enabled, the
      throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
      i7-4850HQ, in case of 8 threads doing random I/O in parallel on
      null_blk, with the latter configured with 0 latency. We obtained the
      same or higher throughput boosts, up to +30%, with other processors
      (some figures are reported in the documentation). For our tests, we
      used the script [1], with which our results can be easily reproduced.
      
      NOTE. This commit still protects the invocation of blkg_*stats_*
      functions with the request_queue lock, because the group these
      functions are invoked on may otherwise disappear before or while these
      functions are executed.  Fortunately, tests without even this lock
      show, by difference, that the serialization caused by this lock has a
      little impact (at most ~5% of throughput reduction).
      
      [1] https://github.com/Algodev-github/IOSpeedTested-by: 's avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      24bfd19b
  10. 31 Aug, 2017 3 commits
    • Paolo Valente's avatar
      block, bfq: guarantee update_next_in_service always returns an eligible entity · 24d90bb2
      Paolo Valente authored
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, then it doesn't
      invoke bfq_lookup_next_entity to get the next-in-service entity. In
      contrast, it follows a shorter path: if E happens to be eligible (see
      commit "bfq-sq-mq: make lookup_next_entity push up vtime on
      expirations" for details on eligibility) and to have a lower virtual
      finish time than the current candidate as next-in-service entity, then
      E directly becomes the next-in-service entity. Unfortunately, there is
      a corner case for which this shorter path makes
      bfq_update_next_in_service choose a non eligible entity: it occurs if
      both E and the current next-in-service entity happen to be non
      eligible when bfq_update_next_in_service is invoked. In this case, E
      is not set as next-in-service, and, since bfq_lookup_next_entity is
      not invoked, the state of the parent entity is not updated so as to
      end up with an eligible entity as the proper next-in-service entity.
      
      In this respect, next-in-service is actually allowed to be non
      eligible while some queue is in service: since no system-virtual-time
      push-up can be performed in that case (see again commit "bfq-sq-mq:
      make lookup_next_entity push up vtime on expirations" for details),
      next-in-service is chosen, speculatively, as a function of the
      possible value that the system virtual time may get after a push
      up. But the correctness of the schedule breaks if next-in-service is
      still a non eligible entity when it is time to set in service the next
      entity. Unfortunately, this may happen in the above corner case.
      
      This commit fixes this problem by making bfq_update_next_in_service
      invoke bfq_lookup_next_entity not only if the above shorter path
      cannot be taken, but also if the shorter path is taken but fails to
      yield an eligible next-in-service entity.
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: 's avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      24d90bb2
    • Paolo Valente's avatar
      block, bfq: remove direct switch to an entity in higher class · a02195ce
      Paolo Valente authored
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, and finds out
      that E belongs to a higher-priority class than that of the current
      next-in-service entity, then it sets next_in_service directly to
      E. But this may lead to anomalous schedules, because E may happen not
      be eligible for service, because its virtual start time is higher than
      the system virtual time for its service tree.
      
      This commit addresses this issue by simply removing this direct
      switch.
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: 's avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      a02195ce
    • Paolo Valente's avatar
      block, bfq: make lookup_next_entity push up vtime on expirations · 80294c3b
      Paolo Valente authored
      To provide a very smooth service, bfq starts to serve a bfq_queue
      only if the queue is 'eligible', i.e., if the same queue would
      have started to be served in the ideal, perfectly fair system that
      bfq simulates internally. This is obtained by associating each
      queue with a virtual start time, and by computing a special system
      virtual time quantity: a queue is eligible only if the system
      virtual time has reached the virtual start time of the
      queue. Finally, bfq guarantees that, when a new queue must be set
      in service, there is always at least one eligible entity for each
      active parent entity in the scheduler. To provide this guarantee,
      the function __bfq_lookup_next_entity pushes up, for each parent
      entity on which it is invoked, the system virtual time to the
      minimum among the virtual start times of the entities in the
      active tree for the parent entity (more precisely, the push up
      occurs if the system virtual time happens to be lower than all
      such virtual start times).
      
      There is however a circumstance in which __bfq_lookup_next_entity
      cannot push up the system virtual time for a parent entity, even
      if the system virtual time is lower than the virtual start times
      of all the child entities in the active tree. It happens if one of
      the child entities is in service. In fact, in such a case, there
      is already an eligible entity, the in-service one, even if it may
      not be not present in the active tree (because in-service entities
      may be removed from the active tree).
      
      Unfortunately, in the last re-design of the
      hierarchical-scheduling engine, the reset of the pointer to the
      in-service entity for a given parent entity--reset to be done as a
      consequence of the expiration of the in-service entity--always
      happens after the function __bfq_lookup_next_entity has been
      invoked. This causes the function to think that there is still an
      entity in service for the parent entity, and then that the system
      virtual time cannot be pushed up, even if actually such a
      no-more-in-service entity has already been properly reinserted
      into the active tree (or in some other tree if no more
      active). Yet, the system virtual time *had* to be pushed up, to be
      ready to correctly choose the next queue to serve. Because of the
      lack of this push up, bfq may wrongly set in service a queue that
      had been speculatively pre-computed as the possible
      next-in-service queue, but that would no more be the one to serve
      after the expiration and the reinsertion into the active trees of
      the previously in-service entities.
      
      This commit addresses this issue by making
      __bfq_lookup_next_entity properly push up the system virtual time
      if an expiration is occurring.
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: 's avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: 's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      80294c3b
  11. 29 Jul, 2017 2 commits
    • Paolo Valente's avatar
      block, bfq: consider also in_service_entity to state whether an entity is active · 46d556e6
      Paolo Valente authored
      Groups of BFQ queues are represented by generic entities in BFQ. When
      a queue belonging to a parent entity is deactivated, the parent entity
      may need to be deactivated too, in case the deactivated queue was the
      only active queue for the parent entity. This deactivation may need to
      be propagated upwards if the entity belongs, in its turn, to a further
      higher-level entity, and so on. In particular, the upward propagation
      of deactivation stops at the first parent entity that remains active
      even if one of its child entities has been deactivated.
      
      To decide whether the last non-deactivation condition holds for a
      parent entity, BFQ checks whether the field next_in_service is still
      not NULL for the parent entity, after the deactivation of one of its
      child entity. If it is not NULL, then there are certainly other active
      entities in the parent entity, and deactivations can stop.
      
      Unfortunately, this check misses a corner case: if in_service_entity
      is not NULL, then next_in_service may happen to be NULL, although the
      parent entity is evidently active. This happens if: 1) the entity
      pointed by in_service_entity is the only active entity in the parent
      entity, and 2) according to the definition of next_in_service, the
      in_service_entity cannot be considered as next_in_service. See the
      comments on the definition of next_in_service for details on this
      second point.
      
      Hitting the above corner case causes crashes.
      
      To address this issue, this commit:
      1) Extends the above check on only next_in_service to controlling both
      next_in_service and in_service_entity (if any of them is not NULL,
      then no further deactivation is performed)
      2) Improves the (important) comments on how next_in_service is defined
      and updated; in particular it fixes a few rather obscure paragraphs
      Reported-by: 's avatarEric Wheeler <bfq-sched@lists.ewheeler.net>
      Reported-by: 's avatarRick Yiu <rick_yiu@htc.com>
      Reported-by: 's avatarTom X Nguyen <tom81094@gmail.com>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: 's avatarEric Wheeler <bfq-sched@lists.ewheeler.net>
      Tested-by: 's avatarRick Yiu <rick_yiu@htc.com>
      Tested-by: 's avatarLaurentiu Nicola <lnicola@dend.ro>
      Tested-by: 's avatarTom X Nguyen <tom81094@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      46d556e6
    • Paolo Valente's avatar
      block, bfq: reset in_service_entity if it becomes idle · 6ab1d8da
      Paolo Valente authored
      BFQ implements hierarchical scheduling by representing each group of
      queues with a generic parent entity. For each parent entity, BFQ
      maintains an in_service_entity pointer: if one of the child entities
      happens to be in service, in_service_entity points to it.  The
      resetting of these pointers happens only on queue expirations: when
      the in-service queue is expired, i.e., stops to be the queue in
      service, BFQ resets all in_service_entity pointers along the
      parent-entity path from this queue to the root entity.
      
      Functions handling the scheduling of entities assume, naturally, that
      in-service entities are active, i.e., have pending I/O requests (or,
      as a special case, even if they have no pending requests, they are
      expected to receive a new request very soon, with the scheduler idling
      the storage device while waiting for such an event). Unfortunately,
      the above resetting scheme of the in_service_entity pointers may cause
      this assumption to be violated.  For example, the in-service queue may
      happen to remain without requests because of a request merge. In this
      case the queue does become idle, and all related data structures are
      updated accordingly. But in_service_entity still points to the queue
      in the parent entity. This inconsistency may even propagate to
      higher-level parent entities, if they happen to become idle as well,
      as a consequence of the leaf queue becoming idle. For this queue and
      parent entities, scheduling functions have an undefined behaviour,
      and, as reported, may easily lead to kernel crashes or hangs.
      
      This commit addresses this issue by simply resetting the
      in_service_entity field also when it is detected to point to an entity
      becoming idle (regardless of why the entity becomes idle).
      Reported-by: 's avatarLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: 's avatarLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      6ab1d8da
  12. 12 Jul, 2017 1 commit
  13. 03 Jul, 2017 1 commit
    • Paolo Valente's avatar
      block, bfq: don't change ioprio class for a bfq_queue on a service tree · 431b17f9
      Paolo Valente authored
      On each deactivation or re-scheduling (after being served) of a
      bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
      to perform pending updates of ioprio, weight and ioprio class for the
      bfq_queue. BFQ also invokes this function on I/O-request dispatches,
      to raise or lower weights more quickly when needed, thereby improving
      latency. However, the entity representing the bfq_queue may be on the
      active (sub)tree of a service tree when this happens, and, although
      with a very low probability, the bfq_queue may happen to also have a
      pending change of its ioprio class. If both conditions hold when
      __bfq_entity_update_weight_prio() is invoked, then the entity moves to
      a sort of hybrid state: the new service tree for the entity, as
      returned by bfq_entity_service_tree(), differs from service tree on
      which the entity still is. The functions that handle activations and
      deactivations of entities do not cope with such a hybrid state (and
      would need to become more complex to cope).
      
      This commit addresses this issue by just making
      __bfq_entity_update_weight_prio() not perform also a possible pending
      change of ioprio class, when invoked on an I/O-request dispatch for a
      bfq_queue. Such a change is thus postponed to when
      __bfq_entity_update_weight_prio() is invoked on deactivation or
      re-scheduling of the bfq_queue.
      Reported-by: 's avatarMarco Piazza <mpiazza@gmail.com>
      Reported-by: 's avatarLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: 's avatarMarco Piazza <mpiazza@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      431b17f9
  14. 10 May, 2017 1 commit
  15. 19 Apr, 2017 1 commit
    • Paolo Valente's avatar
      block, bfq: split bfq-iosched.c into multiple source files · ea25da48
      Paolo Valente authored
      The BFQ I/O scheduler features an optimal fair-queuing
      (proportional-share) scheduling algorithm, enriched with several
      mechanisms to boost throughput and reduce latency for interactive and
      real-time applications. This makes BFQ a large and complex piece of
      code. This commit addresses this issue by splitting BFQ into three
      main, independent components, and by moving each component into a
      separate source file:
      1. Main algorithm: handles the interaction with the kernel, and
      decides which requests to dispatch; it uses the following two further
      components to achieve its goals.
      2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
      computes the schedule, using weights and budgets provided by the above
      component.
      3. cgroups support: handles group operations (creation, destruction,
      move, ...).
      Signed-off-by: 's avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      ea25da48