1. 19 Apr, 2018 2 commits
  2. 18 Apr, 2018 1 commit
  3. 16 Mar, 2018 1 commit
    • Joseph Qi's avatar
      blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir() · 4c699480
      Joseph Qi authored
      We've triggered a WARNING in blk_throtl_bio() when throttling writeback
      io, which complains blkg->refcnt is already 0 when calling blkg_get(),
      and then kernel crashes with invalid page request.
      After investigating this issue, we've found it is caused by a race
      between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
      below:
      
      writeback kworker               cgroup_rmdir
                                        cgroup_destroy_locked
                                          kill_css
                                            css_killed_ref_fn
                                              css_killed_work_fn
                                                offline_css
                                                  blkcg_css_offline
        blkcg_bio_issue_check
          rcu_read_lock
          blkg_lookup
                                                    spin_trylock(q->queue_lock)
                                                    blkg_destroy
                                                    spin_unlock(q->queue_lock)
          blk_throtl_bio
          spin_lock_irq(q->queue_lock)
          ...
          spin_unlock_irq(q->queue_lock)
        rcu_read_unlock
      
      Since rcu can only prevent blkg from releasing when it is being used,
      the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
      blkg release.
      Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
      And then the corresponding blkg_put() will schedule blkg release again,
      which result in double free.
      This race is introduced by commit ae118896 ("blkcg: consolidate blkg
      creation in blkcg_bio_issue_check()"). Before this commit, it will
      lookup first and then try to lookup/create again with queue_lock. Since
      revive this logic is a bit drastic, so fix it by only offlining pd during
      blkcg_css_offline(), and move the rest destruction (especially
      blkg_put()) into blkcg_css_free(), which should be the right way as
      discussed.
      
      Fixes: ae118896 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
      Reported-by: 's avatarJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: 's avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      4c699480
  4. 26 Feb, 2018 1 commit
  5. 04 Nov, 2017 1 commit
  6. 10 Oct, 2017 1 commit
  7. 25 Aug, 2017 1 commit
  8. 01 Jun, 2017 1 commit
    • Bart Van Assche's avatar
      block: Avoid that blk_exit_rl() triggers a use-after-free · b425e504
      Bart Van Assche authored
      Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
      essential that the memory allocated for struct request_queue
      stays around until all blk_exit_rl() calls have finished. Hence
      make blk_init_rl() take a reference on struct request_queue.
      
      This patch fixes the following crash:
      
      general protection fault: 0000 [#2] SMP
      CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      task: ffff88013a108040 task.stack: ffffc9000071c000
      RIP: 0010:free_request_size+0x1a/0x30
      RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
      RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
      RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
      RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
      R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
      R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
      FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
      Call Trace:
       mempool_destroy.part.10+0x21/0x40
       mempool_destroy+0xe/0x10
       blk_exit_rl+0x12/0x20
       blkg_free+0x4d/0xa0
       __blkg_release_rcu+0x59/0x170
       rcu_process_callbacks+0x260/0x4e0
       __do_softirq+0x116/0x250
       smpboot_thread_fn+0x123/0x1e0
       kthread+0x109/0x140
       ret_from_fork+0x31/0x40
      
      Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request")
      Signed-off-by: 's avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Reviewed-by: 's avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: 's avatarChristoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org> # v4.11+
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      b425e504
  9. 29 Mar, 2017 2 commits
    • Tahsin Erdogan's avatar
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 457e490f
      Tahsin Erdogan authored
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Do struct blkcg_gq allocations outside the queue spinlock to allow
      blocking during memory allocations.
      Suggested-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      457e490f
    • Jens Axboe's avatar
      Revert "blkcg: allocate struct blkcg_gq outside request queue spinlock" · d708f0d5
      Jens Axboe authored
      I inadvertently applied the v5 version of this patch, whereas
      the agreed upon version was v5. Revert this one so we can apply
      the right one.
      
      This reverts commit 7fc6b87a.
      d708f0d5
  10. 28 Mar, 2017 1 commit
    • Tahsin Erdogan's avatar
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 7fc6b87a
      Tahsin Erdogan authored
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Update blkg_create() function to temporarily drop the rcu and queue
      locks when it is allowed by gfp mask.
      Suggested-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      7fc6b87a
  11. 02 Mar, 2017 1 commit
  12. 03 Feb, 2017 1 commit
  13. 02 Feb, 2017 1 commit
  14. 18 Jan, 2017 2 commits
  15. 17 Jan, 2017 1 commit
  16. 22 Nov, 2016 1 commit
  17. 30 Sep, 2016 1 commit
  18. 14 Jun, 2016 1 commit
  19. 09 Feb, 2016 1 commit
    • Roman Pen's avatar
      block: fix module reference leak on put_disk() call for cgroups throttle · 39a169b6
      Roman Pen authored
      get_disk(),get_gendisk() calls have non explicit side effect: they
      increase the reference on the disk owner module.
      
      The following is the correct sequence how to get a disk reference and
      to put it:
      
          disk = get_gendisk(...);
      
          /* use disk */
      
          owner = disk->fops->owner;
          put_disk(disk);
          module_put(owner);
      
      fs/block_dev.c is aware of this required module_put() call, but f.e.
      blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
      a module reference.  To see a leakage in action cgroups throttle config
      can be used.  In the following script I'm removing throttle for /dev/ram0
      (actually this is NOP, because throttle was never set for this device):
      
          # lsmod | grep brd
          brd                     5175  0
          # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
              /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
          done
          # lsmod | grep brd
          brd                     5175  100
      
      Now brd module has 100 references.
      
      The issue is fixed by calling module_put() just right away put_disk().
      Signed-off-by: 's avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      39a169b6
  20. 03 Dec, 2015 1 commit
    • Tejun Heo's avatar
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo authored
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: 's avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  21. 22 Oct, 2015 1 commit
    • Tejun Heo's avatar
      blkcg: don't create "io.stat" on the root cgroup · ca0752c5
      Tejun Heo authored
      The stat files on the root cgroup shows stats for the whole system and
      usually don't contain any information which isn't available through
      the usual system monitoring mechanisms.  Some controllers skip
      collecting these duplicate stats to optimize cases where cgroup isn't
      used and later try to emulate the result on demand.
      
      This leads to complexities and subtle differences in the information
      shown through different channels.  This is entirely unnecessary and
      cgroup v2 is dropping stat files which are duplicate from all
      controllers.  This patch removes "io.stat" from the root hierarchy.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Acked-by: 's avatarJens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      ca0752c5
  22. 11 Sep, 2015 1 commit
    • Tejun Heo's avatar
      block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg · 6fe810bd
      Tejun Heo authored
      While making the root blkg unconditional, ec13b1d6 ("blkcg: always
      create the blkcg_gq for the root blkcg") removed the part which clears
      q->root_blkg and ->root_rl.blkg during q exit.  This leaves the two
      pointers dangling after blkg_destroy_all().  blk-throttle exit path
      performs blkg traversals and dereferences ->root_blkg and can lead to
      the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
       IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       ...
       task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
       RIP: 0010:[<ffffffff81389746>]  [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       ...
       Call Trace:
        [<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
        [<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
        [<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
        [<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
        [<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
        [<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
        [<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
        [<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
       ...
      
      While the bug is a straigh-forward use-after-free bug, it is tricky to
      reproduce because blkg release is RCU protected and the rest of exit
      path usually finishes before RCU grace period.
      
      This patch fixes the bug by updating blkg_destro_all() to clear
      q->root_blkg and ->root_rl.blkg.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Reported-by: 's avatar"Richard W.M. Jones" <rjones@redhat.com>
      Reported-by: 's avatarJosh Boyer <jwboyer@fedoraproject.org>
      Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
      Fixes: ec13b1d6 ("blkcg: always create the blkcg_gq for the root blkcg")
      Cc: stable@vger.kernel.org # v4.2+
      Tested-by: 's avatarRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      6fe810bd
  23. 18 Aug, 2015 15 commits
    • Tejun Heo's avatar
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo authored
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
        enabled.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      69d7fde5
    • Tejun Heo's avatar
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo authored
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      2ee867dc
    • Tejun Heo's avatar
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo authored
      * Export blkg_dev_name()
      
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      dd165eb3
    • Tejun Heo's avatar
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo authored
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      36aa9e5f
    • Tejun Heo's avatar
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo authored
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      880f50e2
    • Tejun Heo's avatar
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo authored
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      interface.
      
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      c165b3e3
    • Tejun Heo's avatar
      blkcg: refine error codes returned during blkcg configuration · 20386ce0
      Tejun Heo authored
      blkcg currently returns -EINVAL for most errors which can be pretty
      confusing given that the failure modes are quite varied.  Update the
      error returns so that
      
      * -EINVAL only for syntactic errors.
      * -ERANGE if the value is out of range.
      * -ENODEV if the target device can't be found.
      * -EOPNOTSUPP if the policy is not enabled on the target device.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      20386ce0
    • Tejun Heo's avatar
      blkcg: reduce stack usage of blkg_rwstat_recursive_sum() · 3a7faead
      Tejun Heo authored
      The recent percpu conversion of blkg_rwstat triggered the following
      warning in certain configurations.
      
       block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
      
      This is because blkg_rwstat now contains four percpu_counter which can
      be pretty big depending on debug options although it shouldn't be a
      problem in production configs.  This patch removes one of the two
      local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
      reduce stack usage.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Reported-by: 's avatarkbuild test robot <fengguang.wu@intel.com>
      Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      3a7faead
    • Tejun Heo's avatar
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo authored
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      
      v3: Update blkio-controller.txt accordingly.
      
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      77ea7338
    • Tejun Heo's avatar
      blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq · f12c74ca
      Tejun Heo authored
      Currently, blkg_[rw]stat_recursive_sum() assume that the target
      counter is located in pd (blkg_policy_data); however, some counters
      are planned to be moved to blkg (blkcg_gq).
      
      This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
      blkg_policy pointers instead of pd.  If policy is NULL, it indexes
      into blkg.  If non-NULL, into the blkg's pd of the policy.
      
      The existing usages are updated to maintain the current behaviors.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      f12c74ca
    • Tejun Heo's avatar
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo authored
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      
      * Custom per-cpu stat implementation in blk-throttle is removed.
      
      This makes all blkcg stat counters per-cpu without complicating policy
      implmentations.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      24bdb8ef
    • Tejun Heo's avatar
      blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it · e6269c44
      Tejun Heo authored
      cgroup stats are local to each cgroup and doesn't propagate to
      ancestors by default.  When recursive stats are necessary, the sum is
      calculated over all the descendants.  This initially was for backward
      compatibility to support both group-local and recursive stats but this
      mode of operation makes general sense as stat update is much hotter
      thafn reporting those stats.
      
      This however ends up losing recursive stats when a child is removed.
      To work around this, cfq-iosched adds its stats to its parent
      cfq_group->dead_stats which is summed up together when calculating
      recursive stats.
      
      It's planned that the core stats will be moved to blkcg_gq, so we want
      to move the mechanism for keeping track of the stats of dead children
      from cfq to blkcg core.  This patch adds blkg_[rw]stat->aux_cnt which
      are atomic64_t's keeping track of auxiliary counts which are excluded
      when reading local counts but included for recursive.
      
      blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
      are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
      a dead cgroup to the aux counts of parent->stats instead of separate
      ->dead_stats.
      
      This will also help making blkg_[rw]stats per-cpu.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      e6269c44
    • Tejun Heo's avatar
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo authored
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      performed.
      
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
          !CONFIG_BLK_DEV_THROTTLING.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      ae118896
    • Tejun Heo's avatar
      blkcg: inline [__]blkg_lookup() · 24f29046
      Tejun Heo authored
      blkg_lookup() checks whether the target queue is bypassing and, if
      not, calls __blkg_lookup() which first checks the lookup hint and then
      performs radix tree walk.  The operations upto hint checking are
      trivial and there are many users of this function.  This patch inlines
      blkg_lookup() and the fast path part of __blkg_lookup().  The radix
      tree lookup and hint update are now in blkg_lookup_slowpath().
      
      This will help consolidating blkg handling by easing moving root blkcg
      short-circuit to inlined lookup fast path.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      24f29046
    • Tejun Heo's avatar
      blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods · e4a9bde9
      Tejun Heo authored
      Each active policy has a cpd (blkcg_policy_data) on each blkcg.  The
      cpd's were allocated by blkcg core and each policy could request to
      allocate extra space at the end by setting blkcg_policy->cpd_size
      larger than the size of cpd.
      
      This is a bit unusual but blkg (blkcg_gq) policy data used to be
      handled this way too so it made sense to be consistent; however, blkg
      policy data switched to alloc/free callbacks.
      
      This patch makes similar changes to cpd handling.
      blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size.  As
      cpd allocation is now done from policy side, it can simply allocate a
      larger area which embeds cpd at the beginning.
      
      As ->cpd_alloc_fn() may be able to perform all necessary
      initializations, this patch makes ->cpd_init_fn() optional.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: 's avatarJens Axboe <axboe@fb.com>
      e4a9bde9