Skip to content
Snippets Groups Projects
  1. Aug 20, 2021
  2. Aug 16, 2021
  3. Aug 13, 2021
    • Sean Christopherson's avatar
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson authored
      
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson authored
      
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson authored
      
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
    • Sean Christopherson's avatar
      KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF · 18712c13
      Sean Christopherson authored
      
      Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
      in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
      to account for the case where #PF is intercepted to handle
      guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
      L1.  At best, L1 will complain and inject the #PF back into L2.  At
      worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
      page faults.
      
      Note, while the bug was technically introduced by the commit that added
      support for the MAXPHYADDR madness, the shame is all on commit
      a0c13434 ("KVM: VMX: introduce vmx_need_pf_intercept").
      
      Fixes: 1dbf5d68 ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
      Cc: stable@vger.kernel.org
      Cc: Peter Shier <pshier@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812045615.3167686-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18712c13
    • Junaid Shahid's avatar
      kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault · 85aa8889
      Junaid Shahid authored
      
      When a nested EPT violation/misconfig is injected into the guest,
      the shadow EPT PTEs associated with that address need to be synced.
      This is done by kvm_inject_emulated_page_fault() before it calls
      nested_ept_inject_page_fault(). However, that will only sync the
      shadow EPT PTE associated with the current L1 EPTP. Since the ASID
      is based on EP4TA rather than the full EPTP, so syncing the current
      EPTP is not enough. The SPTEs associated with any other L1 EPTPs
      in the prev_roots cache with the same EP4TA also need to be synced.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Message-Id: <20210806222229.1645356-1-junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85aa8889
    • Paolo Bonzini's avatar
      KVM: x86: remove dead initialization · ffbe17ca
      Paolo Bonzini authored
      
      hv_vcpu is initialized again a dozen lines below, and at this
      point vcpu->arch.hyperv is not valid.  Remove the initializer.
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ffbe17ca
    • Sean Christopherson's avatar
      KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels · 1383279c
      Sean Christopherson authored
      
      Remove an ancient restriction that disallowed exposing EFER.NX to the
      guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
      The motivation of the check, added by commit 2cc51560 ("KVM: VMX:
      Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
      out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
      the guest with the host's EFER.NX and thus avoid context switching EFER
      if the only divergence was the NX bit.
      
      Fast forward to today, and KVM has long since stopped running the guest
      with the host's EFER.NX.  Not only does KVM context switch EFER if
      host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
      guest.EFER.NX=1 when using shadow paging (to emulate SMEP).  Furthermore,
      the entire motivation for the restriction was made obsolete over a decade
      ago when Intel added dedicated host and guest EFER fields in the VMCS
      (Nehalem timeframe), which reduced the overhead of context switching EFER
      from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
      
      In practice, the removed restriction only affects non-PAE 32-bit kernels,
      as EFER.NX is set during boot if NX is supported and the kernel will use
      PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
      will actually use NX itself (mark PTEs non-executable).
      
      Alternatively and/or complementarily, startup_32_smp() in head_32.S could
      be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
      the scenario where NX is supported but not enabled.  However, that runs
      the risk of breaking non-KVM non-PAE kernels (though the risk is very,
      very low as there are no known EFER.NX errata), and also eliminates an
      easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
      across nested virtualization transitions.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210805183804.1221554-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1383279c
  4. Aug 12, 2021
  5. Aug 10, 2021
  6. Aug 09, 2021
    • Pu Lehui's avatar
      powerpc/kprobes: Fix kprobe Oops happens in booke · 43e8f760
      Pu Lehui authored
      
      When using kprobe on powerpc booke series processor, Oops happens
      as show bellow:
      
      / # echo "p:myprobe do_nanosleep" > /sys/kernel/debug/tracing/kprobe_events
      / # echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
      / # sleep 1
      [   50.076730] Oops: Exception in kernel mode, sig: 5 [#1]
      [   50.077017] BE PAGE_SIZE=4K SMP NR_CPUS=24 QEMU e500
      [   50.077221] Modules linked in:
      [   50.077462] CPU: 0 PID: 77 Comm: sleep Not tainted 5.14.0-rc4-00022-g251a1524293d #21
      [   50.077887] NIP:  c0b9c4e0 LR: c00ebecc CTR: 00000000
      [   50.078067] REGS: c3883de0 TRAP: 0700   Not tainted (5.14.0-rc4-00022-g251a1524293d)
      [   50.078349] MSR:  00029000 <CE,EE,ME>  CR: 24000228  XER: 20000000
      [   50.078675]
      [   50.078675] GPR00: c00ebdf0 c3883e90 c313e300 c3883ea0 00000001 00000000 c3883ecc 00000001
      [   50.078675] GPR08: c100598c c00ea250 00000004 00000000 24000222 102490c2 bff4180c 101e60d4
      [   50.078675] GPR16: 00000000 102454ac 00000040 10240000 10241100 102410f8 10240000 00500000
      [   50.078675] GPR24: 00000002 00000000 c3883ea0 00000001 00000000 0000c350 3b9b8d50 00000000
      [   50.080151] NIP [c0b9c4e0] do_nanosleep+0x0/0x190
      [   50.080352] LR [c00ebecc] hrtimer_nanosleep+0x14c/0x1e0
      [   50.080638] Call Trace:
      [   50.080801] [c3883e90] [c00ebdf0] hrtimer_nanosleep+0x70/0x1e0 (unreliable)
      [   50.081110] [c3883f00] [c00ec004] sys_nanosleep_time32+0xa4/0x110
      [   50.081336] [c3883f40] [c001509c] ret_from_syscall+0x0/0x28
      [   50.081541] --- interrupt: c00 at 0x100a4d08
      [   50.081749] NIP:  100a4d08 LR: 101b5234 CTR: 00000003
      [   50.081931] REGS: c3883f50 TRAP: 0c00   Not tainted (5.14.0-rc4-00022-g251a1524293d)
      [   50.082183] MSR:  0002f902 <CE,EE,PR,FP,ME>  CR: 24000222  XER: 00000000
      [   50.082457]
      [   50.082457] GPR00: 000000a2 bf980040 1024b4d0 bf980084 bf980084 64000000 00555345 fefefeff
      [   50.082457] GPR08: 7f7f7f7f 101e0000 00000069 00000003 28000422 102490c2 bff4180c 101e60d4
      [   50.082457] GPR16: 00000000 102454ac 00000040 10240000 10241100 102410f8 10240000 00500000
      [   50.082457] GPR24: 00000002 bf9803f4 10240000 00000000 00000000 100039e0 00000000 102444e8
      [   50.083789] NIP [100a4d08] 0x100a4d08
      [   50.083917] LR [101b5234] 0x101b5234
      [   50.084042] --- interrupt: c00
      [   50.084238] Instruction dump:
      [   50.084483] 4bfffc40 60000000 60000000 60000000 9421fff0 39400402 914200c0 38210010
      [   50.084841] 4bfffc20 00000000 00000000 00000000 <7fe00008> 7c0802a6 7c892378 93c10048
      [   50.085487] ---[ end trace f6fffe98e2fa8f3e ]---
      [   50.085678]
      Trace/breakpoint trap
      
      There is no real mode for booke arch and the MMU translation is
      always on. The corresponding MSR_IS/MSR_DS bit in booke is used
      to switch the address space, but not for real mode judgment.
      
      Fixes: 21f8b2fa ("powerpc/kprobes: Ignore traps that happened in real mode")
      Signed-off-by: default avatarPu Lehui <pulehui@huawei.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210809023658.218915-1-pulehui@huawei.com
      43e8f760
  7. Aug 07, 2021
  8. Aug 06, 2021
  9. Aug 05, 2021
    • H. Nikolaus Schaller's avatar
      mips: Fix non-POSIX regexp · 28bbbb98
      H. Nikolaus Schaller authored
      When cross compiling a MIPS kernel on a BSD based HOSTCC leads
      to errors like
      
        SYNC    include/config/auto.conf.cmd - due to: .config
      egrep: empty (sub)expression
        UPD     include/config/kernel.release
        HOSTCC  scripts/dtc/dtc.o - due to target missing
      
      It turns out that egrep uses this egrep pattern:
      
      		(|MINOR_|PATCHLEVEL_)
      
      This is not valid syntax or gives undefined results according
      to POSIX 9.5.3 ERE Grammar
      
      	https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
      
      
      
      It seems to be silently accepted by the Linux egrep implementation
      while a BSD host complains.
      
      Such patterns can be replaced by a transformation like
      
      	"(|p1|p2)" -> "(p1|p2)?"
      
      Fixes: 48c35b2d ("[MIPS] There is no __GNUC_MAJOR__")
      Signed-off-by: default avatarH. Nikolaus Schaller <hns@goldelico.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      28bbbb98
    • H. Nikolaus Schaller's avatar
      x86/tools/relocs: Fix non-POSIX regexp · fa953adf
      H. Nikolaus Schaller authored
      Trying to run a cross-compiled x86 relocs tool on a BSD based
      HOSTCC leads to errors like
      
        VOFFSET arch/x86/boot/compressed/../voffset.h - due to: vmlinux
        CC      arch/x86/boot/compressed/misc.o - due to: arch/x86/boot/compressed/../voffset.h
        OBJCOPY arch/x86/boot/compressed/vmlinux.bin - due to: vmlinux
        RELOCS  arch/x86/boot/compressed/vmlinux.relocs - due to: vmlinux
      empty (sub)expressionarch/x86/boot/compressed/Makefile:118: recipe for target 'arch/x86/boot/compressed/vmlinux.relocs' failed
      make[3]: *** [arch/x86/boot/compressed/vmlinux.relocs] Error 1
      
      It turns out that relocs.c uses patterns like
      
      	"something(|_end)"
      
      This is not valid syntax or gives undefined results according
      to POSIX 9.5.3 ERE Grammar
      
      	https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
      
      
      
      It seems to be silently accepted by the Linux regexp() implementation
      while a BSD host complains.
      
      Such patterns can be replaced by a transformation like
      
      	"(|p1|p2)" -> "(p1|p2)?"
      
      Fixes: fd952815 ("x86-32, relocs: Whitelist more symbols for ld bug workaround")
      Signed-off-by: default avatarH. Nikolaus Schaller <hns@goldelico.com>
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      fa953adf
    • Huang Pei's avatar
      MIPS: check return value of pgtable_pmd_page_ctor · 6aa32467
      Huang Pei authored
      
      +. According to Documentation/vm/split_page_table_lock, handle failure
      of pgtable_pmd_page_ctor
      
      +. Use GFP_KERNEL_ACCOUNT instead of GFP_KERNEL|__GFP_ACCOUNT
      
      +. Adjust coding style
      
      Fixes: ed914d48 ("MIPS: add PMD table accounting into MIPS')
      Reported-by: default avatarJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Reviewed-by: default avatarJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      6aa32467
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds · d5aaad6f
      Sean Christopherson authored
      
      Take a signed 'long' instead of an 'unsigned long' for the number of
      pages to add/subtract to the total number of pages used by the MMU.  This
      fixes a zero-extension bug on 32-bit kernels that effectively corrupts
      the per-cpu counter used by the shrinker.
      
      Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
      kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
      an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
      to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
      sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
      4294967295 and effectively corrupts the counter.
      
      This was found by a staggering amount of sheer dumb luck when running
      kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
      in while running tests and do_shrink_slab() logged an error about trying
      to free a negative number of objects.  The truly lucky part is that the
      kernel just happened to be a slightly stale build, as the shrinker no
      longer yells about negative objects as of commit 18bb473e ("mm:
      vmscan: shrink deferred objects proportional to priority").
      
       vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460
      
      Fixes: bc8a3d89 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210804214609.1096003-1-seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d5aaad6f
  10. Aug 04, 2021
    • Mingwei Zhang's avatar
      KVM: SVM: improve the code readability for ASID management · bb2baeb2
      Mingwei Zhang authored
      
      KVM SEV code uses bitmaps to manage ASID states. ASID 0 was always skipped
      because it is never used by VM. Thus, in existing code, ASID value and its
      bitmap postion always has an 'offset-by-1' relationship.
      
      Both SEV and SEV-ES shares the ASID space, thus KVM uses a dynamic range
      [min_asid, max_asid] to handle SEV and SEV-ES ASIDs separately.
      
      Existing code mixes the usage of ASID value and its bitmap position by
      using the same variable called 'min_asid'.
      
      Fix the min_asid usage: ensure that its usage is consistent with its name;
      allocate extra size for ASID 0 to ensure that each ASID has the same value
      with its bitmap position. Add comments on ASID bitmap allocation to clarify
      the size change.
      
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alper Gun <alpergun@google.com>
      Cc: Dionna Glaze <dionnaglaze@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vipin Sharma <vipinsh@google.com>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Message-Id: <20210802180903.159381-1-mizhang@google.com>
      [Fix up sev_asid_free to also index by ASID, as suggested by Sean
       Christopherson, and use nr_asids in sev_cpu_init. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb2baeb2
    • Like Xu's avatar
      perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest · df51fe7e
      Like Xu authored
      
      If we use "perf record" in an AMD Milan guest, dmesg reports a #GP
      warning from an unchecked MSR access error on MSR_F15H_PERF_CTLx:
      
        [] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000110076) at rIP: 0xffffffff8106ddb4 (native_write_msr+0x4/0x20)
        [] Call Trace:
        []  amd_pmu_disable_event+0x22/0x90
        []  x86_pmu_stop+0x4c/0xa0
        []  x86_pmu_del+0x3a/0x140
      
      The AMD64_EVENTSEL_HOSTONLY bit is defined and used on the host,
      while the guest perf driver should avoid such use.
      
      Fixes: 1018faa6 ("perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabled")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Tested-by: default avatarKim Phillips <kim.phillips@amd.com>
      Tested-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Link: https://lkml.kernel.org/r/20210802070850.35295-1-likexu@tencent.com
      df51fe7e
    • Peter Zijlstra's avatar
      perf/x86: Fix out of bound MSR access · f4b4b456
      Peter Zijlstra authored
      
      On Wed, Jul 28, 2021 at 12:49:43PM -0400, Vince Weaver wrote:
      > [32694.087403] unchecked MSR access error: WRMSR to 0x318 (tried to write 0x0000000000000000) at rIP: 0xffffffff8106f854 (native_write_msr+0x4/0x20)
      > [32694.101374] Call Trace:
      > [32694.103974]  perf_clear_dirty_counters+0x86/0x100
      
      The problem being that it doesn't filter out all fake counters, in
      specific the above (erroneously) tries to use FIXED_BTS. Limit the
      fixed counters indexes to the hardware supplied number.
      
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Tested-by: default avatarLike Xu <likexu@tencent.com>
      Link: https://lkml.kernel.org/r/YQJxka3dxgdIdebG@hirez.programming.kicks-ass.net
      f4b4b456
    • Pavel Tikhomirov's avatar
      sock: allow reading and changing sk_userlocks with setsockopt · 04190bf8
      Pavel Tikhomirov authored
      
      SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket
      buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and
      tcp_sndbuf_expand()). If we've just created a new socket this adjustment
      is enabled on it, but if one changes the socket buffer size by
      setsockopt(SO_{SND,RCV}BUF*) it becomes disabled.
      
      CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on
      restore as it first needs to increase buffer sizes for packet queues
      restore and second it needs to restore back original buffer sizes. So
      after CRIU restore all sockets become non-auto-adjustable, which can
      decrease network performance of restored applications significantly.
      
      CRIU need to be able to restore sockets with enabled/disabled adjustment
      to the same state it was before dump, so let's add special setsockopt
      for it.
      
      Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so
      that using these interface one can reenable automatic socket buffer
      adjustment on their sockets.
      
      Signed-off-by: default avatarPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04190bf8
    • Sean Christopherson's avatar
      KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB · 179c6c27
      Sean Christopherson authored
      
      Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
      freeing an SEV ASID.  The consumer, pre_sev_run(), indexes the array by
      the raw ASID, thus KVM could get a false negative when checking for a
      different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
      a new VM.
      
      Note, this cannot cause a functional issue _in the current code_, as
      pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
      last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
      guaranteed to mismatch on the first VMRUN.  However, prior to commit
      8a14fe4f ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
      last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
      last_cpu variable.  Thus it's theoretically possible that older versions
      of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
      and VMCB exactly match those of a prior VM.
      
      Fixes: 70cd94e6 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      179c6c27
    • Guenter Roeck's avatar
      riscv: Disable STACKPROTECTOR_PER_TASK if GCC_PLUGIN_RANDSTRUCT is enabled · a18b14d8
      Guenter Roeck authored
      
      riscv uses the value of TSK_STACK_CANARY to set
      stack-protector-guard-offset. With GCC_PLUGIN_RANDSTRUCT enabled, that
      value is non-deterministic, and with riscv:allmodconfig often results
      in build errors such as
      
      cc1: error: '8120' is not a valid offset in '-mstack-protector-guard-offset='
      
      Enable STACKPROTECTOR_PER_TASK only if GCC_PLUGIN_RANDSTRUCT is disabled
      to fix the problem.
      
      Fixes: fea2fed2 ("riscv: Enable per-task stack canaries")
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      a18b14d8
    • Qiu Wenbo's avatar
      riscv: dts: fix memory size for the SiFive HiFive Unmatched · d0956043
      Qiu Wenbo authored
      
      The production version of HiFive Unmatched have 16GB memory.
      
      Signed-off-by: default avatarQiu Wenbo <qiuwenbo@kylinos.com.cn>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      d0956043
    • Vineet Gupta's avatar
      ARC: fp: set FPU_STATUS.FWE to enable FPU_STATUS update on context switch · 3a715e80
      Vineet Gupta authored
      FPU_STATUS register contains FP exception flags bits which are updated
      by core as side-effect of FP instructions but can also be manually
      wiggled such as by glibc C99 functions fe{raise,clear,test}except() etc.
      To effect the update, the programming model requires OR'ing FWE
      bit (31). This bit is write-only and RAZ, meaning it is effectively
      auto-cleared after write and thus needs to be set everytime: which
      is how glibc implements this.
      
      However there's another usecase of FPU_STATUS update, at the time of
      Linux task switch when incoming task value needs to be programmed into
      the register. This was added as part of f45ba2bd ("ARCv2:
      fpu: preserve userspace fpu state") which missed OR'ing FWE bit,
      meaning the new value is effectively not being written at all.
      This patch remedies that.
      
      Interestingly, this snafu was not caught in interm glibc testing as the
      race window which relies on a specific exception bit to be set/clear is
      really small specially when it nvolves context switch.
      Fortunately this was caught by glibc's math/test-fenv-tls test which
      repeatedly set/clear exception flags in a big loop, concurrently in main
      program and also in a thread.
      
      Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/54
      
      
      Fixes: f45ba2bd ("ARCv2: fpu: preserve userspace fpu state")
      Cc: stable@vger.kernel.org	#5.6+
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      3a715e80
Loading