- Jun 08, 2017
-
-
Paolo Valente authored
In blk-cgroup, operations on blkg objects are protected with the request_queue lock. This is no more the lock that protects I/O-scheduler operations in blk-mq. In fact, the latter are now protected with a finer-grained per-scheduler-instance lock. As a consequence, although blkg lookups are also rcu-protected, blk-mq I/O schedulers may see inconsistent data when they access blkg and blkg-related objects. BFQ does access these objects, and does incur this problem, in the following case. The blkg_lookup performed in bfq_get_queue, being protected (only) through rcu, may happen to return the address of a copy of the original blkg. If this is the case, then the blkg_get performed in bfq_get_queue, to pin down the blkg, is useless: it does not prevent blk-cgroup code from destroying both the original blkg and all objects directly or indirectly referred by the copy of the blkg. BFQ accesses these objects, which typically causes a crash for NULL-pointer dereference of memory-protection violation. Some additional protection mechanism should be added to blk-cgroup to address this issue. In the meantime, this commit provides a quick temporary fix for BFQ: cache (when safe) blkg data that might disappear right after a blkg_lookup. In particular, this commit exploits the following facts to achieve its goal without introducing further locks. Destroy operations on a blkg invoke, as a first step, hooks of the scheduler associated with the blkg. And these hooks are executed with bfqd->lock held for BFQ. As a consequence, for any blkg associated with the request queue an instance of BFQ is attached to, we are guaranteed that such a blkg is not destroyed, and that all the pointers it contains are consistent, while that instance is holding its bfqd->lock. A blkg_lookup performed with bfqd->lock held then returns a fully consistent blkg, which remains consistent until this lock is held. In more detail, this holds even if the returned blkg is a copy of the original one. Finally, also the object describing a group inside BFQ needs to be protected from destruction on the blkg_free of the original blkg (which invokes bfq_pd_free). This commit adds private refcounting for this object, to let it disappear only after no bfq_queue refers to it any longer. This commit also removes or updates some stale comments on locking issues related to blk-cgroup operations. Reported-by:
Tomas Konir <tomas.konir@gmail.com> Reported-by:
Lee Tibbert <lee.tibbert@gmail.com> Reported-by:
Marco Piazza <mpiazza@gmail.com> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Tested-by:
Tomas Konir <tomas.konir@gmail.com> Tested-by:
Lee Tibbert <lee.tibbert@gmail.com> Tested-by:
Marco Piazza <mpiazza@gmail.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Jun 07, 2017
-
-
Shaohua Li authored
hard disk IO latency varies a lot depending on spindle move. The latency range could be from several microseconds to several milliseconds. It's pretty hard to get the baseline latency used by io.low. We will use a different stragety here. The idea is only using IO with spindle move to determine if cgroup IO is in good state. For HD, if io latency is small (< 1ms), we ignore the IO. Such IO is likely from sequential IO, and is helpless to help determine if a cgroup's IO is impacted by other cgroups. With this, we only account IO with big latency. Then we can choose a hardcoded baseline latency for HD (4ms, which is typical IO latency with seek). With all these settings, the io.low latency works for both HD and SSD. Signed-off-by:
Shaohua Li <shli@fb.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Joseph Qi authored
I have encountered a NULL pointer dereference in throtl_schedule_pending_timer: [ 413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 [ 413.735535] IP: [<ffffffff812ebbbf>] throtl_schedule_pending_timer+0x3f/0x210 [ 413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0 [ 413.735713] Oops: 0000 [#1] SMP ...... This is caused by the following case: blk_throtl_bio throtl_schedule_next_dispatch <= sq is top level one without parent throtl_schedule_pending_timer sq_to_tg(sq)->td->throtl_slice <= sq_to_tg(sq) returns NULL Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always return a valid td. Fixes: 297e3d85 ("blk-throttle: make throtl_slice tunable") Signed-off-by:
Joseph Qi <qijiang.qj@alibaba-inc.com> Reviewed-by:
Shaohua Li <shli@fb.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Jun 06, 2017
-
-
Ming Lei authored
If queue is stopped, we shouldn't dispatch request into driver and hardware, unfortunately the check is removed in bd166ef1(blk-mq-sched: add framework for MQ capable IO schedulers). This patch fixes the issue by moving the check back into __blk_mq_try_issue_directly(). This patch fixes request use-after-free[1][2] during canceling requets of NVMe in nvme_dev_disable(), which can be triggered easily during NVMe reset & remove test. [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on [ 103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a [ 103.412980] IP: bio_integrity_advance+0x48/0xf0 [ 103.412981] PGD 275a88067 [ 103.412981] P4D 275a88067 [ 103.412982] PUD 276c43067 [ 103.412983] PMD 0 [ 103.412984] [ 103.412986] Oops: 0000 [#1] SMP [ 103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod [ 103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1 [ 103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme] [ 103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000 [ 103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0 [ 103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202 [ 103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240 [ 103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00 [ 103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5 [ 103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18 [ 103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb [ 103.413053] FS: 0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000 [ 103.413054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0 [ 103.413056] Call Trace: [ 103.413063] bio_advance+0x2a/0xe0 [ 103.413067] blk_update_request+0x76/0x330 [ 103.413072] blk_mq_end_request+0x1a/0x70 [ 103.413074] blk_mq_dispatch_rq_list+0x370/0x410 [ 103.413076] ? blk_mq_flush_busy_ctxs+0x94/0xe0 [ 103.413080] blk_mq_sched_dispatch_requests+0x173/0x1a0 [ 103.413083] __blk_mq_run_hw_queue+0x8e/0xa0 [ 103.413085] __blk_mq_delay_run_hw_queue+0x9d/0xa0 [ 103.413088] blk_mq_start_hw_queue+0x17/0x20 [ 103.413090] blk_mq_start_hw_queues+0x32/0x50 [ 103.413095] nvme_kill_queues+0x54/0x80 [nvme_core] [ 103.413097] nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme] [ 103.413103] process_one_work+0x149/0x360 [ 103.413105] worker_thread+0x4d/0x3c0 [ 103.413109] kthread+0x109/0x140 [ 103.413111] ? rescuer_thread+0x380/0x380 [ 103.413113] ? kthread_park+0x60/0x60 [ 103.413120] ret_from_fork+0x2c/0x40 [ 103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8 [ 103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10 [ 103.413146] CR2: 000000000000000a [ 103.413157] ---[ end trace cd6875d16eb5a11e ]--- [ 103.455368] Kernel panic - not syncing: Fatal exception [ 103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 103.850916] ---[ end Kernel panic - not syncing: Fatal exception [ 103.857637] sched: Unexpected reschedule of offline CPU#1! [ 103.863762] ------------[ cut here ]------------ [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off [ 247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds. [ 247.137311] Not tainted 4.12.0-rc2.upstream+ #4 [ 247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.151704] Call Trace: [ 247.154445] __schedule+0x28a/0x880 [ 247.158341] schedule+0x36/0x80 [ 247.161850] blk_mq_freeze_queue_wait+0x4b/0xb0 [ 247.166913] ? remove_wait_queue+0x60/0x60 [ 247.171485] blk_freeze_queue+0x1a/0x20 [ 247.175770] blk_cleanup_queue+0x7f/0x140 [ 247.180252] nvme_ns_remove+0xa3/0xb0 [nvme_core] [ 247.185503] nvme_remove_namespaces+0x32/0x50 [nvme_core] [ 247.191532] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core] [ 247.196977] nvme_remove+0x70/0x110 [nvme] [ 247.201545] pci_device_remove+0x39/0xc0 [ 247.205927] device_release_driver_internal+0x141/0x200 [ 247.211761] device_release_driver+0x12/0x20 [ 247.216531] pci_stop_bus_device+0x8c/0xa0 [ 247.221104] pci_stop_and_remove_bus_device_locked+0x1a/0x30 [ 247.227420] remove_store+0x7c/0x90 [ 247.231320] dev_attr_store+0x18/0x30 [ 247.235409] sysfs_kf_write+0x3a/0x50 [ 247.239497] kernfs_fop_write+0xff/0x180 [ 247.243867] __vfs_write+0x37/0x160 [ 247.247757] ? selinux_file_permission+0xe5/0x120 [ 247.253011] ? security_file_permission+0x3b/0xc0 [ 247.258260] vfs_write+0xb2/0x1b0 [ 247.261964] ? syscall_trace_enter+0x1d0/0x2b0 [ 247.266924] SyS_write+0x55/0xc0 [ 247.270540] do_syscall_64+0x67/0x150 [ 247.274636] entry_SYSCALL64_slow_path+0x25/0x25 [ 247.279794] RIP: 0033:0x7f5c96740840 [ 247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840 [ 247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001 [ 247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740 [ 247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400 [ 247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 [ 370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds. Fixes: 12d70958(blk-mq: don't fail allocating driver tag for stopped hw queue) Cc: stable@vger.kernel.org Signed-off-by:
Ming Lei <ming.lei@redhat.com> Reviewed-by:
Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
When direct issue is done on request picked up from plug list, the hctx need to be updated with the actual hw queue, otherwise wrong hctx is used and may hurt performance, especially when wrong SRCU readlock is acquired/released Reported-by:
Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Jun 03, 2017
-
-
Dmitry Monakhov authored
If bio has no data, such as ones from blkdev_issue_flush(), then we have nothing to protect. This patch prevent bugon like follows: kfree_debugcheck: out of range ptr ac1fa1d106742a5ah kernel BUG at mm/slab.c:2773! invalid opcode: 0000 [#1] SMP Modules linked in: bcache CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43 Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014 task: ffff880137786440 task.stack: ffffc90000ba8000 RIP: 0010:kfree_debugcheck+0x25/0x2a RSP: 0018:ffffc90000babde0 EFLAGS: 00010082 RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40 RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282 R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001 FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0 Call Trace: kfree+0xc8/0x1b3 bio_integrity_free+0xc3/0x16b bio_free+0x25/0x66 bio_put+0x14/0x26 blkdev_issue_flush+0x7a/0x85 blkdev_fsync+0x35/0x42 vfs_fsync_range+0x8e/0x9f vfs_fsync+0x1c/0x1e do_fsync+0x31/0x4a SyS_fsync+0x10/0x14 entry_SYSCALL_64_fastpath+0x1f/0xc2 Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Jun 01, 2017
-
-
Bart Van Assche authored
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is essential that the memory allocated for struct request_queue stays around until all blk_exit_rl() calls have finished. Hence make blk_init_rl() take a reference on struct request_queue. This patch fixes the following crash: general protection fault: 0000 [#2] SMP CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 task: ffff88013a108040 task.stack: ffffc9000071c000 RIP: 0010:free_request_size+0x1a/0x30 RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003 RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418 RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009 R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8 R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0 Call Trace: mempool_destroy.part.10+0x21/0x40 mempool_destroy+0xe/0x10 blk_exit_rl+0x12/0x20 blkg_free+0x4d/0xa0 __blkg_release_rcu+0x59/0x170 rcu_process_callbacks+0x260/0x4e0 __do_softirq+0x116/0x250 smpboot_thread_fn+0x123/0x1e0 kthread+0x109/0x140 ret_from_fork+0x31/0x40 Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request") Signed-off-by:
Bart Van Assche <bart.vanassche@sandisk.com> Acked-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Hannes Reinecke <hare@suse.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> # v4.11+ Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 31, 2017
-
-
Hou Tao authored
When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY as the delay of cfq_group's vdisktime if there have been other cfq_groups already. When cfq is under iops mode, commit 9a7f38c4 ("cfq-iosched: Convert from jiffies to nanoseconds") could result in a large iops delay and lead to an abnormal io schedule delay for the added cfq_group. To fix it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5 when iops mode is enabled. Despite having the same value, the delay of a cfq_queue in idle class and the delay of cfq_group are different things, so I define two new macros for the delay of a cfq_group under time-slice mode and iops mode. Fixes: 9a7f38c4 ("cfq-iosched: Convert from jiffies to nanoseconds") Cc: <stable@vger.kernel.org> # 4.8+ Signed-off-by:
Hou Tao <houtao1@huawei.com> Acked-by:
Jan Kara <jack@suse.cz> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 30, 2017
-
-
Keith Busch authored
The tagset lock needs to be held when iterating the tag_list, so a lockdep assert was added when updating number of hardware queues. The drivers calling this API, however, were unaware of the new requirement, so are failing the assertion. This patch takes the lock within the blk-mq function so the drivers do not have to be modified in order to be safe. Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Reported-by:
Gabriel Krisman Bertazi <krisman@collabora.co.uk> Reviewed-by:
Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by:
Keith Busch <keith.busch@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 26, 2017
-
-
Bart Van Assche authored
The code in blk-mq-debugfs.c assumes that it is working on a blk-mq queue and is not intended to work on a blk-sq queue. Hence only register blk-mq debugfs attributes for blk-mq queues. Fixes: commit 9c1051aa ("blk-mq: untangle debugfs and sysfs") Signed-off-by:
Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 23, 2017
-
-
Richard authored
The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD and NetBSD partitions and does a reasonable job picking out OpenBSD and NetBSD UFS subpartitions. But for FreeBSD the subpartitions are always "bad". Kernel: <bsd:bad subpartition - ignored Though all 3 of these BSD systems use UFS as a file system, only FreeBSD uses relative start addresses in the subpartition declarations. The following patch fixes this for FreeBSD partitions and leaves the code for OpenBSD and NetBSD intact: Signed-off-by:
Richard Narron <comet.berkeley@gmail.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Dan Carpenter authored
We don't set an error code on this path. It means that we return NULL instead of an error pointer and the caller does a NULL dereference. Fixes: 6d1d8050 ("block, partition: add partition_meta_info to hd_struct") Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 22, 2017
-
-
Shaohua Li authored
Default value of io.low limit is 0. If user doesn't configure the limit, last patch makes cgroup be throttled to very tiny bps/iops, which could stall the system. A cgroup with default settings of io.low limit really means nothing, so we force user to configure all settings, otherwise io.low limit doesn't take effect. With this stragety, default setting of latency/idle isn't important, so just set them to very conservative and safe value. Signed-off-by:
Shaohua Li <shli@fb.com> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit is ignored and we throttle the cgroup with its max limit. In this way, other cgroups with a low limit will not get protected. To fix this, we don't do the exception any more. cgroup will be throttled to a limit 0 if it uese default setting. To avoid completed stall, we give such cgroup tiny IO resources. Signed-off-by:
Shaohua Li <shli@fb.com> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
These info are important to understand what's happening and help debug. Signed-off-by:
Shaohua Li <shli@fb.com> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Shaohua Li authored
For idle time, children's setting should not be bigger than parent's. For latency target, children's setting should not be smaller than parent's. The leaf nodes will adjust their settings according to the hierarchy and compare their IO with the settings and do upgrade/downgrade. parents nodes don't need to track their IO latency/idle time. Signed-off-by:
Shaohua Li <shli@fb.com> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
No one uses it any more, so remove it. Reviewed-by:
Keith Busch <keith.busch@intel.com> Reviewed-by:
Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by:
Ming Lei <ming.lei@redhat.com> Signed-off-by:
Christoph Hellwig <hch@lst.de>
-
- May 11, 2017
-
-
Christoph Hellwig authored
SCSI devices can return short writes on Write Same just like for normal writes, so we need to handle this case for our special payload requests as well. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reported-by:
Abdul Haleem <abdhalee@linux.vnet.ibm.com> Tested-by:
Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 10, 2017
-
-
Wen Xiong authored
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns "Input/output error". Looks block layer split the bio after calling bio_integrity_prep(bio). This patch fixes the issue. Below is how we debug this issue: (1)format nvme to 4K block # size with type 2 DIF (2)dd with block size bigger than 1024k. oflag=direct dd: error writing '/dev/nvme0n1': Input/output error We added some debug code in nvme device driver. It showed us the first op and the second op have the same bi and pi address. This is not correct. 1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x505 Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828 2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires.. Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828 With the fix, It showed us both of the first op and the second op have correct bi and pi address. 1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x505 Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828 2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0, AT=0x0 & RT=0x605 Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual 0x00003028 Signed-off-by:
Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough for us to use this_cpu_ptr() in that section. Use the safer get/put_cpu_ptr() variants instead. Reported-by:
Mike Galbraith <efault@gmx.de> Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting") Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
We warn twice for switching to a scheduler, if that switch fails. As we also report the failure in the return value to the sysfs write, remove the dmesg induced failures. Keep the failure print for warning to switch to the kconfig selected IO scheduler, as we can't report errors for that in any other way. Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Paolo Valente authored
The introduction of the BFQ and Kyber I/O schedulers has triggered a new wave of I/O benchmarks. Unfortunately, comments and discussions on these benchmarks confirm that there is still little awareness that it is very hard to achieve, at the same time, a low latency and a high throughput. In particular, virtually all benchmarks measure throughput, or throughput-related figures of merit, but, for BFQ, they use the scheduler in its default configuration. This configuration is geared, instead, toward a low latency. This is evidently a sign that BFQ documentation is still too unclear on this important aspect. This commit addresses this issue by stressing how BFQ configuration must be (easily) changed if the only goal is maximum throughput. Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Paolo Valente authored
In the function __bfq_deactivate_entity, the pointer entity->sched_data could happen to be used before being properly initialized. This led to a NULL pointer dereference. This commit fixes this bug by just using this pointer only where it is safe to do so. Reported-by:
Tom Harrison <l12436.tw@gmail.com> Tested-by:
Tom Harrison <l12436.tw@gmail.com> Signed-off-by:
Paolo Valente <paolo.valente@linaro.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 08, 2017
-
-
Dan Williams authored
For configurations that do not enable DAX filesystems or drivers, do not require the DAX core to be built. Given that the 'direct_access' method has been removed from 'block_device_operations', we can also go ahead and remove the block-related dax helper functions from fs/block_dev.c to drivers/dax/super.c. This keeps dax details out of the block layer and lets the DAX core be built as a module in the FS_DAX=n case. Filesystems need to include dax.h to call bdev_dax_supported(). Cc: linux-xfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by:
Jan Kara <jack@suse.com> Reported-by:
Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by:
Dan Williams <dan.j.williams@intel.com>
-
Colin Ian King authored
Making __blk_mq_stop_hw_queues static fixes sparse warning: block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not declared. Should it be static? Fixes: 2719aa21 ("blk-mq: don't use sync workqueue flushing from drivers") Signed-off-by:
Colin Ian King <colin.king@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Wanpeng Li authored
This can be triggered by hot-unplug one cpu. ====================================================== [ INFO: possible circular locking dependency detected ] 4.11.0+ #17 Not tainted ------------------------------------------------------- step_after_susp/2640 is trying to acquire lock: (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110 but task is already holding lock: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug.lock){+.+.+.}: lock_acquire+0x11c/0x230 __mutex_lock+0x92/0x990 mutex_lock_nested+0x1b/0x20 get_online_cpus+0x64/0x80 blk_mq_init_allocated_queue+0x3a0/0x4e0 blk_mq_init_queue+0x3a/0x60 loop_add+0xe5/0x280 loop_init+0x124/0x177 do_one_initcall+0x53/0x1c0 kernel_init_freeable+0x1e3/0x27f kernel_init+0xe/0x100 ret_from_fork+0x31/0x40 -> #0 (all_q_mutex){+.+...}: __lock_acquire+0x189a/0x18a0 lock_acquire+0x11c/0x230 __mutex_lock+0x92/0x990 mutex_lock_nested+0x1b/0x20 blk_mq_queue_reinit_work+0x18/0x110 blk_mq_queue_reinit_dead+0x1c/0x20 cpuhp_invoke_callback+0x1f2/0x810 cpuhp_down_callbacks+0x42/0x80 _cpu_down+0xb2/0xe0 freeze_secondary_cpus+0xb6/0x390 suspend_devices_and_enter+0x3b3/0xa40 pm_suspend+0x129/0x490 state_store+0x82/0xf0 kobj_attr_store+0xf/0x20 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x135/0x1c0 __vfs_write+0x37/0x160 vfs_write+0xcd/0x1d0 SyS_write+0x58/0xc0 do_syscall_64+0x8f/0x710 return_from_SYSCALL_64+0x0/0x7a other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(all_q_mutex); lock(cpu_hotplug.lock); lock(all_q_mutex); *** DEADLOCK *** 8 locks held by step_after_susp/2640: #0: (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0 #1: (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0 #2: (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0 #3: (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490 #4: (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20 #5: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390 #6: (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0 #7: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0 stack backtrace: CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17 Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016 Call Trace: dump_stack+0x99/0xce print_circular_bug+0x1fa/0x270 __lock_acquire+0x189a/0x18a0 lock_acquire+0x11c/0x230 ? lock_acquire+0x11c/0x230 ? blk_mq_queue_reinit_work+0x18/0x110 ? blk_mq_queue_reinit_work+0x18/0x110 __mutex_lock+0x92/0x990 ? blk_mq_queue_reinit_work+0x18/0x110 ? kmem_cache_free+0x2cb/0x330 ? anon_transport_class_unregister+0x20/0x20 ? blk_mq_queue_reinit_work+0x110/0x110 mutex_lock_nested+0x1b/0x20 ? mutex_lock_nested+0x1b/0x20 blk_mq_queue_reinit_work+0x18/0x110 blk_mq_queue_reinit_dead+0x1c/0x20 cpuhp_invoke_callback+0x1f2/0x810 ? __flow_cache_shrink+0x160/0x160 cpuhp_down_callbacks+0x42/0x80 _cpu_down+0xb2/0xe0 freeze_secondary_cpus+0xb6/0x390 suspend_devices_and_enter+0x3b3/0xa40 ? rcu_read_lock_sched_held+0x79/0x80 pm_suspend+0x129/0x490 state_store+0x82/0xf0 kobj_attr_store+0xf/0x20 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x135/0x1c0 __vfs_write+0x37/0x160 ? rcu_read_lock_sched_held+0x79/0x80 ? rcu_sync_lockdep_assert+0x2f/0x60 ? __sb_start_write+0xd9/0x1c0 ? vfs_write+0x1ad/0x1d0 vfs_write+0xcd/0x1d0 SyS_write+0x58/0xc0 ? rcu_read_lock_sched_held+0x79/0x80 do_syscall_64+0x8f/0x710 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will contend these two locks in the inversion order. This is due to commit eabe0659 (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion issue because of hotplug rework, however the hotplug rework is still work-in-progress and lives in a -tip branch and mainline cannot yet trigger that splat. The commit breaks the linus's tree in the merge window, so this patch reverts the lock order and avoids to splat linus's tree. Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 04, 2017
-
-
Omar Sandoval authored
Expose the fifo lists, cached next requests, batching state, and dispatch list. It'd also be possible to add the sorted lists, but there aren't already seq_file helpers for rbtrees. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Expose the domain token pools, asynchronous sbitmap depth, domain request lists, and batching state. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
This provides the infrastructure for schedulers to expose their internal state through debugfs. We add a list of queue attributes and a list of hctx attributes to struct elevator_type and wire them up when switching schedulers. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Add missing seq_file.h header in blk-mq-debugfs.h Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Originally, I tied debugfs registration/unregistration together with sysfs. There's no reason to do this, and it's getting in the way of letting schedulers define their own debugfs attributes. Instead, tie the debugfs registration to the lifetime of the structures themselves. The saner lifetimes mean we can also get rid of the extra mq directory and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now just nvme0n1/hctx0/tags. Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Preparation for adding more declarations. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
In commit e869b546 ("blk-mq: Unregister debugfs attributes earlier"), we shuffled the debugfs cleanup around so that the "state" attribute was removed before we freed the blk-mq data structures. However, later changes are going to undo that, so we need to explicitly disallow running a dead queue. [Omar: rebased and updated commit message] Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
A large part of blk-mq-debugfs.c is file_operations and seq_file boilerplate. This sucks as is but will suck even more when schedulers can define their own debugfs entries. Factor it all out into a single blk_mq_debugfs_fops which multiplexes as needed. We store the request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory dentry, which is kind of hacky, but it works. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
It's not clear what these numbered directories represent unless you consult the code. We're about to get rid of the intermediate "mq" directory, so these would be even more confusing without that context. Signed-off-by:
Omar Sandoval <osandov@fb.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Slightly more readable, plus we also strip leading spaces. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
blk_queue_flags_store() currently truncates and returns a short write if the operation being written is too long. This can give us weird results, like here: $ echo "run bar" echo: write error: invalid argument $ dmesg [ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start' Instead, return an error if the user does this. While we're here, make the argument names consistent with everywhere else in this file. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
Make sure the spelled out flag names match the definition. This also adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing cmd_flag, __REQ_NOUNMAP. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Omar Sandoval authored
This reads more naturally than spaces. Signed-off-by:
Omar Sandoval <osandov@fb.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Peter Zijlstra authored
By poking at /debug/sched_features I triggered the following splat: [] ====================================================== [] WARNING: possible circular locking dependency detected [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted [] ------------------------------------------------------ [] bash/2109 is trying to acquire lock: [] (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50 [] [] but task is already holding lock: [] (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170 [] [] which lock already depends on the new lock. [] [] [] the existing dependency chain (in reverse order) is: [] [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}: [] lock_acquire+0x100/0x210 [] down_write+0x28/0x60 [] start_creating+0x5e/0xf0 [] debugfs_create_dir+0x13/0x110 [] blk_mq_debugfs_register+0x21/0x70 [] blk_mq_register_dev+0x64/0xd0 [] blk_register_queue+0x6a/0x170 [] device_add_disk+0x22d/0x440 [] loop_add+0x1f3/0x280 [] loop_init+0x104/0x142 [] do_one_initcall+0x43/0x180 [] kernel_init_freeable+0x1de/0x266 [] kernel_init+0xe/0x100 [] ret_from_fork+0x31/0x40 [] [] -> #1 (all_q_mutex){+.+.+.}: [] lock_acquire+0x100/0x210 [] __mutex_lock+0x6c/0x960 [] mutex_lock_nested+0x1b/0x20 [] blk_mq_init_allocated_queue+0x37c/0x4e0 [] blk_mq_init_queue+0x3a/0x60 [] loop_add+0xe5/0x280 [] loop_init+0x104/0x142 [] do_one_initcall+0x43/0x180 [] kernel_init_freeable+0x1de/0x266 [] kernel_init+0xe/0x100 [] ret_from_fork+0x31/0x40 [] *** DEADLOCK *** [] [] 3 locks held by bash/2109: [] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0 [] #1: (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0 [] #2: (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170 [] [] stack backtrace: [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694 [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013 [] Call Trace: [] lock_acquire+0x100/0x210 [] get_online_cpus+0x2a/0x90 [] static_key_slow_dec+0x1b/0x50 [] static_key_disable+0x20/0x30 [] sched_feat_write+0x131/0x170 [] full_proxy_write+0x97/0xd0 [] __vfs_write+0x28/0x120 [] vfs_write+0xb5/0x1a0 [] SyS_write+0x49/0xa0 [] entry_SYSCALL_64_fastpath+0x23/0xc2 This is because of the cpu hotplug lock rework. Break the chain at #1 by reversing the lock acquisition order. This way i_mutex_key#4 no longer depends on cpu_hotplug_lock and things are good. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- May 03, 2017
-
-
Jens Axboe authored
A previous commit introduced the sync flush, which we need from internal callers like blk_mq_quiesce_queue(). However, we also call the stop helpers from drivers, particularly from ->queue_rq() when we have to stop processing for a bit. We can't block from those locations, and we don't have to guarantee that we're fully flushed. Fixes: 9f993737 ("blk-mq: unify hctx delayed_run_work and run_work") Reviewed-by:
Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-