- Jul 15, 2021
-
-
Pavel Skripkin authored
In commit bc9e9e67 ("KVM: debugfs: Reuse binary stats descriptors") loop for filling debugfs_stat_data was copy-pasted 2 times, but in the second loop pointers are saved over pointers allocated in the first loop. All this causes is a memory leak, fix it. Fixes: bc9e9e67 ("KVM: debugfs: Reuse binary stats descriptors") Signed-off-by:
Pavel Skripkin <paskripkin@gmail.com> Reviewed-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210701195500.27097-1-paskripkin@gmail.com> Reviewed-by:
Jing Zhang <jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jul 14, 2021
-
-
Kefeng Wang authored
BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183 Read of size 8 at addr ffff0000c03a2500 by task syz-executor083/4269 CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132 show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x164 lib/dump_stack.c:118 print_address_description+0x78/0x5c8 mm/kasan/report.c:385 __kasan_report mm/kasan/report.c:545 [inline] kasan_report+0x148/0x1e4 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:183 [inline] __asan_load8+0xb4/0xbc mm/kasan/generic.c:252 kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183 kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common arch/arm64/kernel/syscall.c:158 [inline] do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670 Allocated by task 4269: stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121 kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461 kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475 kmem_cache_alloc_trace include/linux/slab.h:450 [inline] kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:664 [inline] kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146 kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common arch/arm64/kernel/syscall.c:158 [inline] do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670 Freed by task 4269: stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121 kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x38/0x6c mm/kasan/common.c:56 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355 __kasan_slab_free+0x124/0x150 mm/kasan/common.c:422 kasan_slab_free+0x10/0x1c mm/kasan/common.c:431 slab_free_hook mm/slub.c:1544 [inline] slab_free_freelist_hook mm/slub.c:1577 [inline] slab_free mm/slub.c:3142 [inline] kfree+0x104/0x38c mm/slub.c:4124 coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102 kvm_iodevice_destructor include/kvm/iodev.h:61 [inline] kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374 kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186 kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common arch/arm64/kernel/syscall.c:158 [inline] do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220 el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367 el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383 el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670 If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor() inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list and free the dev, but kvm_iodevice_destructor() is called again, it will lead the above issue. Let's check the the return value of kvm_io_bus_unregister_dev(), only call kvm_iodevice_destructor() if the return value is 0. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Kefeng Wang <wangkefeng.wang@huawei.com> Message-Id: <20210626070304.143456-1-wangkefeng.wang@huawei.com> Cc: stable@vger.kernel.org Fixes: 5d3c4c79 ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20) Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jun 29, 2021
-
-
Liam Howlett authored
vma_lookup() finds the vma of a specific address with a cleaner interface and is more readable. Link: https://lkml.kernel.org/r/20210521174745.2219620-11-Liam.Howlett@Oracle.com Signed-off-by:
Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by:
Laurent Dufour <ldufour@linux.ibm.com> Acked-by:
David Hildenbrand <david@redhat.com> Acked-by:
Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Jun 24, 2021
-
-
Jing Zhang authored
To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Ricardo Koller <ricarkol@google.com> Reviewed-by:
Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add a VM ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VM stats header, descriptors and data. Define VM statistics descriptors and header for all architectures. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Ricardo Koller <ricarkol@google.com> Reviewed-by:
Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-4-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Nicholas Piggin authored
It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on valid pages if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). This addresses CVE-2021-22543. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Tested-by:
Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
This commit defines the API for userspace and prepare the common functionalities to support per VM/VCPU binary stats data readings. The KVM stats now is only accessible by debugfs, which has some shortcomings this change series are supposed to fix: 1. The current debugfs stats solution in KVM could be disabled when kernel Lockdown mode is enabled, which is a potential rick for production. 2. The current debugfs stats solution in KVM is organized as "one stats per file", it is good for debugging, but not efficient for production. 3. The stats read/clear in current debugfs solution in KVM are protected by the global kvm_lock. Besides that, there are some other benefits with this change: 1. All KVM VM/VCPU stats can be read out in a bulk by one copy to userspace. 2. A schema is used to describe KVM statistics. From userspace's perspective, the KVM statistics are self-describing. 3. With the fd-based solution, a separate telemetry would be able to read KVM stats in a less privileged environment. 4. After the initial setup by reading in stats descriptors, a telemetry only needs to read the stats data itself, no more parsing or setup is needed. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Ricardo Koller <ricarkol@google.com> Reviewed-by:
Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-3-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Generic KVM stats are those collected in architecture independent code or those supported by all architectures; put all generic statistics in a separate structure. This ensures that they are defined the same way in the statistics API which is being added, removing duplication among different architectures in the declaration of the descriptors. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Ricardo Koller <ricarkol@google.com> Reviewed-by:
Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by:
Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-2-jingzhangos@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jun 18, 2021
-
-
Peter Zijlstra authored
When ran from the sched-out path (preempt_notifier or perf_event), p->state is irrelevant to determine preemption. You can get preempted with !task_is_running() just fine. The right indicator for preemption is if the task is still on the runqueue in the sched-out path. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org
-
- Jun 17, 2021
-
-
Paolo Bonzini authored
Make them the same type as vCPU stats. There is no reason to limit the counters to unsigned long. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sergey Senozhatsky authored
Add KVM PM-notifier so that architectures can have arch-specific VM suspend/resume routines. Such architectures need to select CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier(). Signed-off-by:
Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20210606021045.14159-1-senozhatsky@chromium.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Add a new lock to protect the arch-specific fields of memslots if they need to be modified in a kvm->srcu read critical section. A future commit will use this lock to lazily allocate memslot rmaps for x86. Signed-off-by:
Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-5-bgardon@google.com> [Add Documentation/ hunk. - Paolo] Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
Factor out copying kvm_memslots from allocating the memory for new ones in preparation for adding a new lock to protect the arch-specific fields of the memslots. No functional change intended. Reviewed-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Ben Gardon <bgardon@google.com> Message-Id: <20210518173414.450044-4-bgardon@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- May 27, 2021
-
-
Marcelo Tosatti authored
For VMX, when a vcpu enters HLT emulation, pi_post_block will: 1) Add vcpu to per-cpu list of blocked vcpus. 2) Program the posted-interrupt descriptor "notification vector" to POSTED_INTR_WAKEUP_VECTOR With interrupt remapping, an interrupt will set the PIR bit for the vector programmed for the device on the CPU, test-and-set the ON bit on the posted interrupt descriptor, and if the ON bit is clear generate an interrupt for the notification vector. This way, the target CPU wakes upon a device interrupt and wakes up the target vcpu. Problem is that pi_post_block only programs the notification vector if kvm_arch_has_assigned_device() is true. Its possible for the following to happen: 1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false, notification vector is not programmed 2) device is assigned to VM 3) device interrupts vcpu V, sets ON bit (notification vector not programmed, so pcpu P remains in idle) 4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set, kvm_vcpu_kick is skipped 5) vcpu 0 busy spins on vcpu V's response for several seconds, until RCU watchdog NMIs all vCPUs. To fix this, use the start_assignment kvm_x86_ops callback to kick vcpus out of the halt loop, so the notification vector is properly reprogrammed to the wakeup vector. Reported-by:
Pei Zhang <pezhang@redhat.com> Signed-off-by:
Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210526172014.GA29007@fuller.cnet> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Marcelo Tosatti authored
KVM_REQ_UNBLOCK will be used to exit a vcpu from its inner vcpu halt emulation loop. Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch PowerPC to arch specific request bit. Signed-off-by:
Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20210525134321.303768132@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
This is inspired by commit 262de410 (kvm: exit halt polling on need_resched() as well). Due to PPC implements an arch specific halt polling logic, we have to the need_resched() check there as well. This patch adds a helper function that can be shared between book3s and generic halt-polling loops. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Venkatesh Srinivas <venkateshs@chromium.org> Cc: Ben Segall <bsegall@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: Jim Mattson <jmattson@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by:
Wanpeng Li <wanpengli@tencent.com> Message-Id: <1621339235-11131-1-git-send-email-wanpengli@tencent.com> [Make the function inline. - Paolo] Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- May 15, 2021
-
-
Zhu Lingshan authored
This reverts commit a979a6aa. The reverted commit may cause VM freeze on arm64 with GICv4, where stopping a consumer is implemented by suspending the VM. Should the connect fail, the VM will not be resumed, which is a bit of a problem. It also erroneously calls the producer destructor unconditionally, which is unexpected. Reported-by:
Shaokun Zhang <zhangshaokun@hisilicon.com> Suggested-by:
Marc Zyngier <maz@kernel.org> Acked-by:
Jason Wang <jasowang@redhat.com> Acked-by:
Michael S. Tsirkin <mst@redhat.com> Reviewed-by:
Eric Auger <eric.auger@redhat.com> Tested-by:
Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by:
Zhu Lingshan <lingshan.zhu@intel.com> [maz: tags and cc-stable, commit message update] Signed-off-by:
Marc Zyngier <maz@kernel.org> Fixes: a979a6aa ("irqbypass: do not start cons/prod when failed connect") Link: https://lore.kernel.org/r/3a2c66d6-6ca0-8478-d24b-61e8e3241b20@hisilicon.com Link: https://lore.kernel.org/r/20210508071152.722425-1-lingshan.zhu@intel.com Cc: stable@vger.kernel.org
-
- May 07, 2021
-
-
David Matlack authored
When growing halt-polling, there is no check that the poll time exceeds the per-VM limit. It's possible for vcpu->halt_poll_ns to grow past kvm->max_halt_poll_ns and stay there until a halt which takes longer than kvm->halt_poll_ns. Signed-off-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Venkatesh Srinivas <venkateshs@chromium.org> Message-Id: <20210506152442.4010298-1-venkateshs@chromium.org> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- May 03, 2021
-
-
Benjamin Segall authored
single_task_running() is usually more general than need_resched() but CFS_BANDWIDTH throttling will use resched_task() when there is just one task to get the task to block. This was causing long-need_resched warnings and was likely allowing VMs to overrun their quota when halt polling. Signed-off-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Venkatesh Srinivas <venkateshs@chromium.org> Message-Id: <20210429162233.116849-1-venkateshs@chromium.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Reviewed-by:
Jim Mattson <jmattson@google.com>
-
- Apr 21, 2021
-
-
Wanpeng Li authored
Both lock holder vCPU and IPI receiver that has halted are condidate for boost. However, the PLE handler was originally designed to deal with the lock holder preemption problem. The Intel PLE occurs when the spinlock waiter is in kernel mode. This assumption doesn't hold for IPI receiver, they can be in either kernel or user mode. the vCPU candidate in user mode will not be boosted even if they should respond to IPIs. Some benchmarks like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most of the time they are running in user mode. It can lead to a large number of continuous PLE events because the IPI sender causes PLE events repeatedly until the receiver is scheduled while the receiver is not candidate for a boost. This patch boosts the vCPU candidiate in user mode which is delivery interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 96 HTs Intel CLX box). There is no performance regression for other benchmarks like Unixbench spawn (most of the time contend read/write lock in kernel mode), ebizzy (most of the time contend read/write sem and TLB shoodtdown in kernel mode). Signed-off-by:
Wanpeng Li <wanpengli@tencent.com> Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Nathan Tempelman authored
Add a capability for userspace to mirror SEV encryption context from one vm to another. On our side, this is intended to support a Migration Helper vCPU, but it can also be used generically to support other in-guest workloads scheduled by the host. The intention is for the primary guest and the mirror to have nearly identical memslots. The primary benefits of this are that: 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they can't accidentally clobber each other. 2) The VMs can have different memory-views, which is necessary for post-copy migration (the migration vCPUs on the target need to read and write to pages, when the primary guest would VMEXIT). This does not change the threat model for AMD SEV. Any memory involved is still owned by the primary guest and its initial state is still attested to through the normal SEV_LAUNCH_* flows. If userspace wanted to circumvent SEV, they could achieve the same effect by simply attaching a vCPU to the primary VM. This patch deliberately leaves userspace in charge of the memslots for the mirror, as it already has the power to mess with them in the primary guest. This patch does not support SEV-ES (much less SNP), as it does not handle handing off attested VMSAs to the mirror. For additional context, we need a Migration Helper because SEV PSP migration is far too slow for our live migration on its own. Using an in-guest migrator lets us speed this up significantly. Signed-off-by:
Nathan Tempelman <natet@google.com> Message-Id: <20210408223214.2582277-1-natet@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 20, 2021
-
-
Sean Christopherson authored
Convert a comment above kvm_io_bus_unregister_dev() into an actual lockdep assertion, and opportunistically add curly braces to a multi-line for-loop. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev() fails to allocate memory for the new instance of the bus. If it can't instantiate a new bus, unregister_dev() destroys all devices _except_ the target device. But, it doesn't tell the caller that it obliterated the bus and invoked the destructor for all devices that were on the bus. In the coalesced MMIO case, this can result in a deleted list entry dereference due to attempting to continue iterating on coalesced_zones after future entries (in the walk) have been deleted. Opportunistically add curly braces to the for-loop, which encompasses many lines but sneaks by without braces due to the guts being a single if statement. Fixes: f6588660 ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Reported-by:
Hao Sun <sunhao.th@gmail.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
If allocating a new instance of an I/O bus fails when unregistering a device, wait to destroy the device until after all readers are guaranteed to see the new null bus. Destroying devices before the bus is nullified could lead to use-after-free since readers expect the devices on their reference of the bus to remain valid. Fixes: f6588660 ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210412222050.876100-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 17, 2021
-
-
Sean Christopherson authored
Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been detected in the memslots, i.e. don't take the lock for notifications that don't affect the guest. For small VMs, spurious locking is a minor annoyance. And for "volatile" setups where the majority of notifications _are_ relevant, this barely qualifies as an optimization. But, for large VMs (hundreds of threads) with static setups, e.g. no page migration, no swapping, etc..., the vast majority of MMU notifier callbacks will be unrelated to the guest, e.g. will often be in response to the userspace VMM adjusting its own virtual address space. In such large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from handling page faults. In some scenarios it can even be "fatal" in the sense that it causes unacceptable brownouts, e.g. when rebuilding huge pages after live migration, a significant percentage of vCPUs will be attempting to handle page faults. x86's TDP MMU implementation is especially susceptible to spurious locking due it taking mmu_lock for read when handling page faults. Because rwlock is fair, a single writer will stall future readers, while the writer is itself stalled waiting for in-progress readers to complete. This is exacerbated by the MMU notifiers often firing multiple times in quick succession, e.g. moving a page will (always?) invoke three separate notifiers: .invalidate_range_start(), invalidate_range_end(), and .change_pte(). Unnecessarily taking mmu_lock each time means even a single spurious sequence can be problematic. Note, this optimizes only the unpaired callbacks. Optimizing the .invalidate_range_{start,end}() pairs is more complex and will be done in a future patch. Suggested-by:
Ben Gardon <bgardon@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-9-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Acquire and release mmu_lock in the __kvm_handle_hva_range() helper instead of requiring the caller to do the same. This paves the way for future patches to take mmu_lock if and only if an overlapping memslot is found, without also having to introduce the on_lock() shenanigans used to manipulate the notifier count and sequence. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-8-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Yank out the hva-based MMU notifier APIs now that all architectures that use the notifiers have moved to the gfn-based APIs. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-7-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the hva->gfn lookup for MMU notifiers into common code. Every arch does a similar lookup, and some arch code is all but identical across multiple architectures. In addition to consolidating code, this will allow introducing optimizations that will benefit all architectures without incurring multiple walks of the memslots, e.g. by taking mmu_lock if and only if a relevant range exists in the memslots. The use of __always_inline to avoid indirect call retpolines, as done by x86, may also benefit other architectures. Consolidating the lookups also fixes a wart in x86, where the legacy MMU and TDP MMU each do their own memslot walks. Lastly, future enhancements to the memslot implementation, e.g. to add an interval tree to track host address, will need to touch far less arch specific code. MIPS, PPC, and arm64 will be converted one at a time in future patches. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
In KVM's .change_pte() notification callback, replace the notifier sequence bump with a WARN_ON assertion that the notifier count is elevated. An elevated count provides stricter protections than bumping the sequence, and the sequence is guarnateed to be bumped before the count hits zero. When .change_pte() was added by commit 828502d3 ("ksm: add mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary as .change_pte() would be invoked without any surrounding notifications. However, since commit 6bdb913f ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end"), all calls to .change_pte() are guaranteed to be surrounded by start() and end(), and so are guaranteed to run with an elevated notifier count. Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes .change_pte() is called when the relevant SPTE is present in KVM's MMU, as the original goal was to accelerate Kernel Samepage Merging (KSM) by updating KVM's SPTEs without requiring a VM-Exit (due to invalidating the SPTE). I.e. it means that .change_pte() is effectively dead code on _all_ architectures. x86 and MIPS are clearcut nops if the old SPTE is not-present, and that is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE, which again should be a nop due to the invalidation. arm64 is a bit murky, but it's also likely a nop because kvm_pgtable_stage2_map() is called without a cache pointer, which means it will map an entry if and only if an existing PTE was found. For now, take advantage of the bug to simplify future consolidation of KVMs's MMU notifier code. Doing so will not greatly complicate fixing .change_pte(), assuming it's even worth fixing. .change_pte() has been broken for 8+ years and no one has complained. Even if there are KSM+KVM users that care deeply about its performance, the benefits of avoiding VM-Exits via .change_pte() need to be reevaluated to justify the added complexity and testing burden. Ripping out .change_pte() entirely would be a lot easier. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that that the allocations are accounted, to make it easier to audit KVM's allocations in the future, and to be consistent with other cache usage in KVM. When using SLAB/SLUB, this is a nop as the cache itself is created with SLAB_ACCOUNT. When using SLOB, there are caveats within caveats. SLOB doesn't honor SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU allocations now being accounted. But, even that depends on internal SLOB details as SLOB will only go to the page allocator when its cache is depleted. That just happens to be extremely likely for vCPUs because the size of kvm_vcpu is larger than the a page for almost all combinations of architecture and page size. Whether or not the SLOB behavior is by design is unknown; it's just as likely that no SLOB users care about accounding and so no one has bothered to implemented support in SLOB. Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup users, if any exist. Reviewed-by:
Wanpeng Li <kernellwp@gmail.com> Acked-by:
David Rientjes <rientjes@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210406190740.4055679-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move arm64's MMU notifier trace events into common code in preparation for doing the hva->gfn lookup in common code. The alternative would be to trace the gfn instead of hva, but that's not obviously better and could also be done in common code. Tracing the notifiers is also quite handy for debug regardless of architecture. Remove a completely redundant tracepoint from PPC e500. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-10-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 22, 2021
-
-
David Stevens authored
Track the range being invalidated by mmu_notifier and skip page fault retries if the fault address is not affected by the in-progress invalidation. Handle concurrent invalidations by finding the minimal range which includes all ranges being invalidated. Although the combined range may include unrelated addresses and cannot be shrunk as individual invalidation operations complete, it is unlikely the marginal gains of proper range tracking are worth the additional complexity. The primary benefit of this change is the reduction in the likelihood of extreme latency when handing a page fault due to another thread having been preempted while modifying host virtual addresses. Signed-off-by:
David Stevens <stevensd@chromium.org> Message-Id: <20210222024522.1751719-3-stevensd@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 09, 2021
-
-
Sean Christopherson authored
Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving a so called "remapped" hva/pfn pair. In theory, the hva could resolve to a pfn in high memory on a 32-bit kernel. This bug was inadvertantly exposed by commit bd2fae8d ("KVM: do not assume PTE is writable after follow_pfn"), which added an error PFN value to the mix, causing gcc to comlain about overflowing the unsigned long. arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’: include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’ to ‘long unsigned int’ changes value from ‘9218868437227405314’ to ‘2’ [-Werror=overflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’ Cc: stable@vger.kernel.org Fixes: add6a0cd ("KVM: MMU: try to fix up page faults before giving up") Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20210208201940.1258328-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 04, 2021
-
-
Ben Gardon authored
Add a read / write lock to be used in place of the MMU spinlock on x86. The rwlock will enable the TDP MMU to handle page faults, and other operations in parallel in future commits. Reviewed-by:
Peter Feiner <pfeiner@google.com> Signed-off-by:
Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-19-bgardon@google.com> [Introduce virt/kvm/mmu_lock.h - Paolo] Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Tian Tao authored
fixed the following warning: /virt/kvm/dirty_ring.c:70:20-27: WARNING: vzalloc should be used for ring -> dirty_gfns, instead of vmalloc/memset. Signed-off-by:
Tian Tao <tiantao6@hisilicon.com> Message-Id: <1611547045-13669-1-git-send-email-tiantao6@hisilicon.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738 ("KVM: Handle vma regions with no backing page") Reported-by:
David Stevens <stevensd@google.com> Cc: 3pvd@google.com Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 21, 2021
-
-
Marc Zyngier authored
The use of a tagged address could be pretty confusing for the whole memslot infrastructure as well as the MMU notifiers. Forbid it altogether, as it never quite worked the first place. Cc: stable@vger.kernel.org Reported-by:
Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by:
Catalin Marinas <catalin.marinas@arm.com> Signed-off-by:
Marc Zyngier <maz@kernel.org>
-
- Jan 07, 2021
-
-
Lai Jiangshan authored
In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as: need_tlb_flush |= kvm->tlbs_dirty; with need_tlb_flush's type being int and tlbs_dirty's type being long. It means that tlbs_dirty is always used as int and the higher 32 bits is useless. We need to check tlbs_dirty in a correct way and this change checks it directly without propagating it to need_tlb_flush. Note: it's _extremely_ unlikely this neglecting of higher 32 bits can cause problems in practice. It would require encountering tlbs_dirty on a 4 billion count boundary, and KVM would need to be using shadow paging or be running a nested guest. Cc: stable@vger.kernel.org Fixes: a4ee1ca4 ("KVM: MMU: delay flush all tlbs on sync_page path") Signed-off-by:
Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20201217154118.16497-1-jiangshanlai@gmail.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-