1. 12 Apr, 2015 2 commits
  2. 31 Mar, 2015 1 commit
    • Mike Snitzer's avatar
      block: fix blk_stack_limits() regression due to lcm() change · e9637415
      Mike Snitzer authored
      Linux 3.19 commit 69c953c8 ("lib/lcm.c: lcm(n,0)=lcm(0,n) is 0, not n")
      caused blk_stack_limits() to not properly stack queue_limits for stacked
      devices (e.g. DM).
      Fix this regression by establishing lcm_not_zero() and switching
      blk_stack_limits() over to using it.
      DM uses blk_set_stacking_limits() to establish the initial top-level
      queue_limits that are then built up based on underlying devices' limits
      using blk_stack_limits().  In the case of optimal_io_size (io_opt)
      blk_set_stacking_limits() establishes a default value of 0.  With commit
      69c953c8, lcm(0, n) is no longer n, which compromises proper stacking of
      the underlying devices' io_opt.
      $ modprobe scsi_debug dev_size_mb=10 num_tgts=1 opt_blks=1536
      $ cat /sys/block/sde/queue/optimal_io_size
      $ dmsetup create node --table "0 100 linear /dev/sde 0"
      Before this fix:
      $ cat /sys/block/dm-5/queue/optimal_io_size
      After this fix:
      $ cat /sys/block/dm-5/queue/optimal_io_size
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.19+
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  3. 20 Mar, 2015 1 commit
  4. 18 Mar, 2015 1 commit
  5. 13 Mar, 2015 1 commit
  6. 21 Feb, 2015 1 commit
    • Thadeu Lima de Souza Cascardo's avatar
      blk-throttle: check stats_cpu before reading it from sysfs · 045c47ca
      Thadeu Lima de Souza Cascardo authored
      When reading blkio.throttle.io_serviced in a recently created blkio
      cgroup, it's possible to race against the creation of a throttle policy,
      which delays the allocation of stats_cpu.
      Like other functions in the throttle code, just checking for a NULL
      stats_cpu prevents the following oops caused by that race.
      [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
      [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
      [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
      [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
      [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
      [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
      [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
      [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
      [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
      [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
      GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
      GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
      GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
      GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
      GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
      GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
      [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
      [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
      [ 1137.734943] Call Trace:
      [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
      [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
      [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
      [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
      [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
      [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
      [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
      [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
      [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
      [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
      [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
      [ 1137.735383] Instruction dump:
      [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
      [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
      And here is one code that allows to easily reproduce this, although this
      has first been found by running docker.
      void run(pid_t pid)
      	int n;
      	int status;
      	int fd;
      	char *buffer;
      	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
      	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
      	fd = open(CGPATH "/test/tasks", O_WRONLY);
      	write(fd, buffer, n);
      	if (fork() > 0) {
      		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
      		read(fd, buffer, 512);
      	} else {
      		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
      		n = read(fd, buffer, BUFFER_SIZE);
      void test(void)
      	int status;
      	mkdir(CGPATH "/test", 0666);
      	if (fork() > 0)
      	rmdir(CGPATH "/test");
      int main(int argc, char **argv)
      	int i;
      	for (i = 0; i < NR_TESTS; i++)
      	return 0;
      Reported-by: default avatarRicardo Marin Matinata <rmm@br.ibm.com>
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  7. 11 Feb, 2015 4 commits
  8. 10 Feb, 2015 1 commit
  9. 09 Feb, 2015 1 commit
  10. 05 Feb, 2015 8 commits
  11. 04 Feb, 2015 1 commit
    • Peter Zijlstra's avatar
      block: Simplify bsg complete all · 2c561246
      Peter Zijlstra authored
      It took me a few tries to figure out what this code did; lets rewrite
      it into a more regular form.
      The thing that makes this one 'special' is the BSG_F_BLOCK flag, if
      that is not set we're not supposed/allowed to block and should spin
      wait for completion.
      The (new) io_wait_event() will never see a false condition in case of
      the spinning and we will therefore not block.
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  12. 29 Jan, 2015 2 commits
  13. 28 Jan, 2015 3 commits
  14. 23 Jan, 2015 2 commits
    • Shaohua Li's avatar
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li authored
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Shaohua Li's avatar
      block: support different tag allocation policy · ee1b6f7a
      Shaohua Li authored
      The libata tag allocation is using a round-robin policy. Next patch will
      make libata use block generic tag allocation, so let's add a policy to
      tag allocation.
      Currently two policies: FIFO (default) and round-robin.
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  15. 22 Jan, 2015 1 commit
  16. 21 Jan, 2015 2 commits
    • Martin K. Petersen's avatar
      block: Add discard flag to blkdev_issue_zeroout() function · d93ba7a5
      Martin K. Petersen authored
      blkdev_issue_discard() will zero a given block range. This is done by
      way of explicit writing, thus provisioning or allocating the blocks on
      There are use cases where the desired behavior is to zero the blocks but
      unprovision them if possible. The blocks must deterministically contain
      zeroes when they are subsequently read back.
      This patch adds a flag to blkdev_issue_zeroout() that provides this
      variant. If the discard flag is set and a block device guarantees
      discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
      the device does not support discard_zeroes_data or if the discard
      request fails we will fall back to first REQ_WRITE_SAME and then a
      regular REQ_WRITE.
      Also update the callers of blkdev_issue_zero() to reflect the new flag
      and make sb_issue_zeroout() prefer the discard approach.
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Jeff Moyer's avatar
      cfq-iosched: fix incorrect filing of rt async cfqq · c6ce1943
      Jeff Moyer authored
      If you can manage to submit an async write as the first async I/O from
      the context of a process with realtime scheduling priority, then a
      cfq_queue is allocated, but filed into the wrong async_cfqq bucket.  It
      ends up in the best effort array, but actually has realtime I/O
      scheduling priority set in cfqq->ioprio.
      The reason is that cfq_get_queue assumes the default scheduling class and
      priority when there is no information present (i.e. when the async cfqq
      is created):
      static struct cfq_queue *
      cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
      	      struct bio *bio, gfp_t gfp_mask)
      	const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);
      cic->ioprio starts out as 0, which is "invalid".  So, class of 0
      (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:
      		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
      static struct cfq_queue **
      cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
              switch (ioprio_class) {
              case IOPRIO_CLASS_RT:
                      return &cfqd->async_cfqq[0][ioprio];
              case IOPRIO_CLASS_NONE:
                      ioprio = IOPRIO_NORM;
                      /* fall through */
              case IOPRIO_CLASS_BE:
                      return &cfqd->async_cfqq[1][ioprio];
              case IOPRIO_CLASS_IDLE:
                      return &cfqd->async_idle_cfqq;
      Here, instead of returning a class mapped from the process' scheduling
      priority, we get back the bucket associated with IOPRIO_CLASS_BE.
      Now, there is no queue allocated there yet, so we create it:
      		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);
      That function ends up doing this:
      			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
      			cfq_init_prio_data(cfqq, cic);
      cfq_init_cfqq marks the priority as having changed.  Then, cfq_init_prio
      data does this:
      	ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	switch (ioprio_class) {
      		printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
      	case IOPRIO_CLASS_NONE:
      		 * no prio set, inherit CPU scheduling settings
      		cfqq->ioprio = task_nice_ioprio(tsk);
      		cfqq->ioprio_class = task_nice_ioclass(tsk);
      So we basically have two code paths that treat IOPRIO_CLASS_NONE
      differently, which results in an RT async cfqq filed into a best effort
      Attached is a patch which fixes the problem.  I'm not sure how to make
      it cleaner.  Suggestions would be welcome.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Tested-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  17. 20 Jan, 2015 2 commits
  18. 14 Jan, 2015 1 commit
    • Jens Axboe's avatar
      blk-mq: fix false negative out-of-tags condition · 0bf36498
      Jens Axboe authored
      The blk-mq tagging tries to maintain some locality between CPUs and
      the tags issued. The tags are split into groups of words, and the
      words may not be fully populated. When searching for a new free tag,
      blk-mq may look at partial words, hence it passes in an offset/size
      to find_next_zero_bit(). However, it does that wrong, the size must
      always be the full length of the number of tags in that word,
      otherwise we'll potentially miss some near the end.
      Another issue is when __bt_get() goes from one word set to the next.
      It bumps the index, but not the last_tag associated with the
      previous index. Bump that to be in the range of the new word.
      Finally, clean up __bt_get() and __bt_get_word() a bit and get
      rid of the goto in there, and the unnecessary 'wrap' variable.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  19. 08 Jan, 2015 5 commits