1. 05 Aug, 2018 5 commits
    • Nicolai Stange's avatar
      x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() · 18b57ce2
      Nicolai Stange authored
      
      
      For VMEXITs caused by external interrupts, vmx_handle_external_intr()
      indirectly calls into the interrupt handlers through the host's IDT.
      
      It follows that these interrupts get accounted for in the
      kvm_cpu_l1tf_flush_l1d per-cpu flag.
      
      The subsequently executed vmx_l1d_flush() will thus be aware that some
      interrupts have happened and conduct a L1d flush anyway.
      
      Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed
      anymore. Drop it.
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      18b57ce2
    • Nicolai Stange's avatar
      x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d · 45b575c0
      Nicolai Stange authored
      
      
      Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
      VMENTRY.
      
      L1D flushes are costly and two modes of operations are provided to users:
      "always" and the more selective "conditional" mode.
      
      If operating in the latter, the cache would get flushed only if a host side
      code path considered unconfined had been traversed. "Unconfined" in this
      context means that it might have pulled in sensitive data like user data
      or kernel crypto keys.
      
      The need for L1D flushes is tracked by means of the per-vcpu flag
      l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
      vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
      L1d flush based on its value and reset that flag again.
      
      Currently, interrupts delivered "normally" while in root operation between
      VMEXIT and VMENTER are not taken into account. Part of the reason is that
      these don't leave any traces and thus, the vmx code is unable to tell if
      any such has happened.
      
      As proposed by Paolo Bonzini, prepare for tracking all interrupts by
      introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
      strong analogy to the per-vcpu ->l1tf_flush_l1d.
      
      A later patch will make interrupt handlers set it.
      
      For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
      per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
      
      Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
      kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
      trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
      
      Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
      l1tf_flush_l1d.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45b575c0
    • Nicolai Stange's avatar
      x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() · 5b6ccc6c
      Nicolai Stange authored
      
      
      Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes
      vmx_l1d_flush() if so.
      
      This test is unncessary for the "always flush L1D" mode.
      
      Move the check to vmx_l1d_flush()'s conditional mode code path.
      
      Notes:
      - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no
        extra function call.
        
      - This inverts the (static) branch prediction, but there hadn't been any
        explicit likely()/unlikely() annotations before and so it stays as is.
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5b6ccc6c
    • Nicolai Stange's avatar
      x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' · 427362a1
      Nicolai Stange authored
      
      
      The vmx_l1d_flush_always static key is only ever evaluated if
      vmx_l1d_should_flush is enabled. In that case however, there are only two
      L1d flushing modes possible: "always" and "conditional".
      
      The "conditional" mode's implementation tends to require more sophisticated
      logic than the "always" mode.
      
      Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
      with a 'vmx_l1d_flush_cond' one.
      
      There is no change in functionality.
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      427362a1
    • Nicolai Stange's avatar
      x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() · 379fd0c7
      Nicolai Stange authored
      
      
      vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
      point in setting l1tf_flush_l1d to true from there again.
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      379fd0c7
  2. 19 Jul, 2018 1 commit
    • Nicolai Stange's avatar
      x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content · 288d152c
      Nicolai Stange authored
      The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
      to evict the L1d cache.
      
      However, these pages are never cleared and, in theory, their data could be
      leaked.
      
      More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
      to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
      the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.
      
      Fix this by initializing the individual vmx_l1d_flush_pages with a
      different pattern each.
      
      Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
      "flush_pages" to reflect this change.
      
      Fixes: a47dd5f0
      
       ("x86/KVM/VMX: Add L1D flush algorithm")
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      288d152c
  3. 13 Jul, 2018 8 commits
  4. 04 Jul, 2018 10 commits
  5. 14 Jun, 2018 1 commit
    • Arnd Bergmann's avatar
      KVM: x86: VMX: redo fix for link error without CONFIG_HYPERV · 1f008e11
      Arnd Bergmann authored
      Arnd had sent this patch to the KVM mailing list, but it slipped through
      the cracks of maintainers hand-off, and therefore wasn't included in
      the pull request.
      
      The same issue had been fixed by Linus in commit dbee3d02
      
       ("KVM: x86:
      VMX: fix build without hyper-v", 2018-06-12) as a self-described
      "quick-and-hacky build fix".  However, checking the compile-time
      configuration symbol with IS_ENABLED is cleaner and it is enough to
      avoid the link error, so switch to Arnd's solution.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      [Rewritten commit message. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1f008e11
  6. 13 Jun, 2018 1 commit
    • Linus Torvalds's avatar
      KVM: x86: VMX: fix build without hyper-v · dbee3d02
      Linus Torvalds authored
      Commit ceef7d10 ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap
      support") broke the build with Hyper-V disabled, because it accesses
      ms_hyperv.nested_features without checking if that exists.
      
      This is the quick-and-hacky build fix.
      
      I suspect the proper fix is to replace the
      
          static_branch_unlikely(&enable_evmcs)
      
      tests with an inline helper function that also checks that CONFIG_HYPERV
      is enabled, since without that, enable_evmcs makes no sense.
      
      But I want a working build environment first and foremost, and I'm upset
      this slipped through in the first place.  My primary build tests missed
      it because I tend to build with everything enabled, but it should have
      been caught in the kvm tree.
      
      Fixes: ceef7d10
      
       ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support")
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbee3d02
  7. 12 Jun, 2018 2 commits
  8. 04 Jun, 2018 3 commits
    • Jim Mattson's avatar
      kvm: nVMX: Add support for "VMWRITE to any supported field" · f4160e45
      Jim Mattson authored
      
      
      Add support for "VMWRITE to any supported field in the VMCS" and
      enable this feature by default in L1's IA32_VMX_MISC MSR. If userspace
      clears the VMX capability bit, the old behavior will be restored.
      
      Note that this feature is a prerequisite for kvm in L1 to use VMCS
      shadowing, once that feature is available.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f4160e45
    • Jim Mattson's avatar
      kvm: nVMX: Restrict VMX capability MSR changes · a943ac50
      Jim Mattson authored
      
      
      Disallow changes to the VMX capability MSRs while the vCPU is in VMX
      operation. Although this does break the existing API, it helps to
      avoid some potentially tricky situations for which there is no
      architected behavior.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a943ac50
    • Wanpeng Li's avatar
      KVM: VMX: Optimize tscdeadline timer latency · c5ce8235
      Wanpeng Li authored
      'Commit d0659d94
      
       ("KVM: x86: add option to advance tscdeadline
      hrtimer expiration")' advances the tscdeadline (the timer is emulated
      by hrtimer) expiration in order that the latency which is incurred
      by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch
      adds the advance tscdeadline expiration support to which the tscdeadline
      timer is emulated by VMX preemption timer to reduce the hypervisor
      lantency (handle_preemption_timer -> vmentry). The guest can also
      set an expiration that is very small (for example in Linux if an
      hrtimer feeds a expiration in the past); in that case we set delta_tsc
      to 0, leading to an immediately vmexit when delta_tsc is not bigger than
      advance ns.
      
      This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on
      a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c5ce8235
  9. 01 Jun, 2018 1 commit
    • Marc Orr's avatar
      kvm: Make VM ioctl do valloc for some archs · d1e5b0e9
      Marc Orr authored
      
      
      The kvm struct has been bloating. For example, it's tens of kilo-bytes
      for x86, which turns out to be a large amount of memory to allocate
      contiguously via kzalloc. Thus, this patch does the following:
      1. Uses architecture-specific routines to allocate the kvm struct via
         vzalloc for x86.
      2. Switches arm to __KVM_HAVE_ARCH_VM_ALLOC so that it can use vzalloc
         when has_vhe() is true.
      
      Other architectures continue to default to kalloc, as they have a
      dependency on kalloc or have a small-enough struct kvm.
      Signed-off-by: default avatarMarc Orr <marcorr@google.com>
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d1e5b0e9
  10. 24 May, 2018 3 commits
  11. 23 May, 2018 3 commits
    • Jim Mattson's avatar
      KVM: nVMX: Ensure that VMCS12 field offsets do not change · 21ebf53b
      Jim Mattson authored
      
      
      Enforce the invariant that existing VMCS12 field offsets must not
      change. Experience has shown that without strict enforcement, this
      invariant will not be maintained.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [Changed the code to use BUILD_BUG_ON_MSG instead of better, but GCC 4.6
       requiring _Static_assert. - Radim.]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      21ebf53b
    • Jim Mattson's avatar
      KVM: nVMX: Restore the VMCS12 offsets for v4.0 fields · b348e793
      Jim Mattson authored
      
      
      Changing the VMCS12 layout will break save/restore compatibility with
      older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are
      accepted upstream. Google has already been using these ioctls for some
      time, and we implore the community not to disturb the existing layout.
      
      Move the four most recently added fields to preserve the offsets of
      the previously defined fields and reserve locations for the vmread and
      vmwrite bitmaps, which will be used in the virtualization of VMCS
      shadowing (to improve the performance of double-nesting).
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [Kept the SDM order in vmcs_field_to_offset_table. - Radim]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      b348e793
    • Jim Mattson's avatar
      kvm: nVMX: Use nested_run_pending rather than from_vmentry · 6514dc38
      Jim Mattson authored
      When saving a vCPU's nested state, the vmcs02 is discarded. Only the
      shadow vmcs12 is saved. The shadow vmcs12 contains all of the
      information needed to reconstruct an equivalent vmcs02 on restore, but
      we have to be able to deal with two contexts:
      
      1. The nested state was saved immediately after an emulated VM-entry,
         before the vmcs02 was ever launched.
      
      2. The nested state was saved some time after the first successful
         launch of the vmcs02.
      
      Though it's an implementation detail rather than an architected bit,
      vmx->nested_run_pending serves to distinguish between these two
      cases. Hence, we save it as part of the vCPU's nested state. (Yes,
      this is ugly.)
      
      Even when restoring from a checkpoint, it may be necessary to build
      the vmcs02 as if prepare_vmcs02 was called from nested_vmx_run. So,
      the 'from_vmentry' argument should be dropped, and
      vmx->nested_run_pending should be consulted instead. The nested state
      restoration code then has to set vmx->nested_run_pending prior to
      calling prepare_vmcs02. It's important that the restoration code set
      vmx->nested_run_pending anyway, since the flag impacts things like
      interrupt delivery as well.
      
      Fixes: cf8b84f4
      
       ("kvm: nVMX: Prepare for checkpointing L2 state")
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      6514dc38
  12. 17 May, 2018 2 commits