Skip to content
Snippets Groups Projects
  1. Dec 22, 2021
  2. Dec 21, 2021
    • John David Anglin's avatar
      parisc: Fix mask used to select futex spinlock · d3a5a68c
      John David Anglin authored
      
      The address bits used to select the futex spinlock need to match those used in
      the LWS code in syscall.S. The mask 0x3f8 only selects 7 bits.  It should
      select 8 bits.
      
      This change fixes the glibc nptl/tst-cond24 and nptl/tst-cond25 tests.
      
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Fixes: 53a42b63 ("parisc: Switch to more fine grained lws locks")
      Cc: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      d3a5a68c
    • John David Anglin's avatar
      parisc: Correct completer in lws start · 8f66fce0
      John David Anglin authored
      
      The completer in the "or,ev %r1,%r30,%r30" instruction is reversed, so we are
      not clipping the LWS number when we are called from a 32-bit process (W=0).
      We need to nulify the following depdi instruction when the least-significant
      bit of %r30 is 1.
      
      If the %r20 register is not clipped, a user process could perform a LWS call
      that would branch to an undefined location in the kernel and potentially crash
      the machine.
      
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      8f66fce0
    • Sean Christopherson's avatar
      KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU · fdba608f
      Sean Christopherson authored
      
      Drop a check that guards triggering a posted interrupt on the currently
      running vCPU, and more importantly guards waking the target vCPU if
      triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
      If a vIRQ is delivered from asynchronous context, the target vCPU can be
      the currently running vCPU and can also be blocking, in which case
      skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
      be a wake event for the vCPU.
      
      The "do nothing" logic when "vcpu == running_vcpu" mostly works only
      because the majority of calls to ->deliver_posted_interrupt(), especially
      when using posted interrupts, come from synchronous KVM context.  But if
      a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
      and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
      use posted interrupts, wake events from the device will be delivered to
      KVM from IRQ context, e.g.
      
        vfio_msihandler()
        |
        |-> eventfd_signal()
            |
            |-> ...
                |
                |->  irqfd_wakeup()
                     |
                     |->kvm_arch_set_irq_inatomic()
                        |
                        |-> kvm_irq_delivery_to_apic_fast()
                            |
                            |-> kvm_apic_set_irq()
      
      This also aligns the non-nested and nested usage of triggering posted
      interrupts, and will allow for additional cleanups.
      
      Fixes: 379a3c8e ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarLongpeng (Mike) <longpeng2@huawei.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdba608f
  3. Dec 20, 2021
    • Helge Deller's avatar
      parisc: Clear stale IIR value on instruction access rights trap · 484730e5
      Helge Deller authored
      
      When a trap 7 (Instruction access rights) occurs, this means the CPU
      couldn't execute an instruction due to missing execute permissions on
      the memory region.  In this case it seems the CPU didn't even fetched
      the instruction from memory and thus did not store it in the cr19 (IIR)
      register before calling the trap handler. So, the trap handler will find
      some random old stale value in cr19.
      
      This patch simply overwrites the stale IIR value with a constant magic
      "bad food" value (0xbaadf00d), in the hope people don't start to try to
      understand the various random IIR values in trap 7 dumps.
      
      Noticed-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      484730e5
    • Sean Christopherson's avatar
      KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required · cd0e615c
      Sean Christopherson authored
      
      Synthesize a triple fault if L2 guest state is invalid at the time of
      VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
      guest state via ioctls(), e.g. KVM_SET_SREGS.  KVM should never emulate
      invalid guest state, since from L1's perspective, it's architecturally
      impossible for L2 to have invalid state while L2 is running in hardware.
      E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
      or #GP.
      
      Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
      can trigger this scenario, as nested VM-Enter correctly rejects any
      attempt to enter L2 with invalid state.
      
      RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
      behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
      via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
      Following AMD's SMRAM layout is important as AMD's layout saves/restores
      the descriptor cache information, including CS.RPL and SS.RPL, and also
      defines all the fields relevant to invalid guest state as read-only, i.e.
      so long as the vCPU had valid state before the SMI, which is guaranteed
      for L2, RSM will generate valid state unless SMRAM was modified.  Intel's
      layout saves/restores only the selector, which means that scenarios where
      the selector and cached RPL don't match, e.g. conforming code segments,
      would yield invalid guest state.  Intel CPUs fudge around this issued by
      stuffing SS.RPL and CS.RPL on RSM.  Per Intel's SDM on the "Default
      Treatment of RSM", paraphrasing for brevity:
      
        IF internal storage indicates that the [CPU was post-VMXON]
        THEN
           enter VMX operation (root or non-root);
           restore VMX-critical state as defined in Section 34.14.1;
           set to their fixed values any bits in CR0 and CR4 whose values must
           be fixed in VMX operation [unless coming from an unrestricted guest];
           IF RFLAGS.VM = 0 AND (in VMX root operation OR the
              “unrestricted guest” VM-execution control is 0)
           THEN
             CS.RPL := SS.DPL;
             SS.RPL := SS.DPL;
           FI;
           restore current VMCS pointer;
        FI;
      
      Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
      will sythesize TRIPLE_FAULT in this scenario.  KVM's behavior is allowed
      as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
      only way for CR0 and/or CR4 to have illegal values is if they were
      modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
      section states "modifying these registers will result in unpredictable
      behavior".
      
      KVM's ioctl() behavior is less straightforward.  Because KVM allows
      ioctls() to be executed in any order, rejecting an ioctl() if it would
      result in invalid L2 guest state is not an option as KVM cannot know if
      a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
      drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE.  Ideally, KVM would
      reject KVM_RUN if L2 contained invalid guest state, but that carries the
      risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
      exited to userspace.  Setting a flag/request to detect such a scenario is
      undesirable because (a) it's extremely unlikely to add value to KVM as a
      whole, and (b) KVM would need to consider ioctl() interactions with such
      a flag, e.g. if userspace migrated the vCPU while the flag were set.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-3-seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd0e615c
    • Sean Christopherson's avatar
      KVM: VMX: Always clear vmx->fail on emulation_required · a80dfc02
      Sean Christopherson authored
      
      Revert a relatively recent change that set vmx->fail if the vCPU is in L2
      and emulation_required is true, as that behavior is completely bogus.
      Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:
      
        (a) it's impossible to have both a VM-Fail and VM-Exit
        (b) vmcs.EXIT_REASON is not modified on VM-Fail
        (c) emulation_required refers to guest state and guest state checks are
            always VM-Exits, not VM-Fails.
      
      For KVM specifically, emulation_required is handled before nested exits
      in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
      i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
      Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
      firing when tearing down the VM as KVM never expects vmx->fail to be set
      when L2 is active, KVM always reflects those errors into L1.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
                                      nested_vmx_vmexit+0x16bd/0x17e0
                                      arch/x86/kvm/vmx/nested.c:4547
        Modules linked in:
        CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
        Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
        Call Trace:
         vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
         nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
         vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
         kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
         kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
         kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
         kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
         kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
         kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
         __fput+0x3fc/0x870 fs/file_table.c:280
         task_work_run+0x146/0x1c0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0x705/0x24f0 kernel/exit.c:832
         do_group_exit+0x168/0x2d0 kernel/exit.c:929
         get_signal+0x1740/0x2120 kernel/signal.c:2852
         arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
         do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c8607e4a ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
      Reported-by: default avatar <syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a80dfc02
    • Marc Orr's avatar
      KVM: x86: Always set kvm_run->if_flag · c5063551
      Marc Orr authored
      
      The kvm_run struct's if_flag is a part of the userspace/kernel API. The
      SEV-ES patches failed to set this flag because it's no longer needed by
      QEMU (according to the comment in the source code). However, other
      hypervisors may make use of this flag. Therefore, set the flag for
      guests with encrypted registers (i.e., with guest_state_protected set).
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Signed-off-by: default avatarMarc Orr <marcorr@google.com>
      Message-Id: <20211209155257.128747-1-marcorr@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      c5063551
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't advance iterator after restart due to yielding · 3a0f64de
      Sean Christopherson authored
      
      After dropping mmu_lock in the TDP MMU, restart the iterator during
      tdp_iter_next() and do not advance the iterator.  Advancing the iterator
      results in skipping the top-level SPTE and all its children, which is
      fatal if any of the skipped SPTEs were not visited before yielding.
      
      When zapping all SPTEs, i.e. when min_level == root_level, restarting the
      iter and then invoking tdp_iter_next() is always fatal if the current gfn
      has as a valid SPTE, as advancing the iterator results in try_step_side()
      skipping the current gfn, which wasn't visited before yielding.
      
      Sprinkle WARNs on iter->yielded being true in various helpers that are
      often used in conjunction with yielding, and tag the helper with
      __must_check to reduce the probabily of improper usage.
      
      Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
      SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
      the shadow page will be leaked and KVM will WARN accordingly.
      
        WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2a0 [kvm]
         kvm_vcpu_release+0x34/0x60 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5c/0x90
         do_exit+0x364/0xa10
         ? futex_unqueue+0x38/0x60
         do_group_exit+0x33/0xa0
         get_signal+0x155/0x850
         arch_do_signal_or_restart+0xed/0x750
         exit_to_user_mode_prepare+0xc5/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x48/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
      kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
      marking a struct page as dirty/accessed after it has been put back on the
      free list.  This directly triggers a WARN due to encountering a page with
      page_count() == 0, but it can also lead to data corruption and additional
      errors in the kernel.
      
        WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
        RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
        Call Trace:
         <TASK>
         kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
         __handle_changed_spte+0x92e/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         zap_gfn_range+0x549/0x620 [kvm]
         kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
         mmu_free_root_page+0x219/0x2c0 [kvm]
         kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
         kvm_mmu_unload+0x1c/0xa0 [kvm]
         kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
         kvm_put_kvm+0x3b1/0x8b0 [kvm]
         kvm_vcpu_release+0x4e/0x70 [kvm]
         __fput+0x1f7/0x8c0
         task_work_run+0xf8/0x1a0
         do_exit+0x97b/0x2230
         do_group_exit+0xda/0x2a0
         get_signal+0x3be/0x1e50
         arch_do_signal_or_restart+0x244/0x17f0
         exit_to_user_mode_prepare+0xcb/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x4d/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Note, the underlying bug existed even before commit 1af4a960 ("KVM:
      x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
      tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
      incorrectly advance past a top-level entry when yielding on a lower-level
      entry.  But with respect to leaking shadow pages, the bug was introduced
      by yielding before processing the current gfn.
      
      Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
      callers could jump to their "retry" label.  The downside of that approach
      is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
      in the loop, and there's no easy way to enfornce that requirement.
      
      Ideally, KVM would handling the cond_resched() fully within the iterator
      macro (the code is actually quite clean) and avoid this entire class of
      bugs, but that is extremely difficult do while also supporting yielding
      after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
      SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
      bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
      may block operations on the SPTE for a significant amount of time.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 1af4a960 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
      Reported-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211214033528.123268-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3a0f64de
    • Wei Wang's avatar
      KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all · 9fb12fe5
      Wei Wang authored
      
      The fixed counter 3 is used for the Topdown metrics, which hasn't been
      enabled for KVM guests. Userspace accessing to it will fail as it's not
      included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines,
      which have this counter.
      
      To reproduce it on ICX+ machines, ./state_test reports:
      ==== Test Assertion Failure ====
      lib/x86_64/processor.c:1078: r == nmsrs
      pid=4564 tid=4564 - Argument list too long
      1  0x000000000040b1b9: vcpu_save_state at processor.c:1077
      2  0x0000000000402478: main at state_test.c:209 (discriminator 6)
      3  0x00007fbe21ed5f92: ?? ??:0
      4  0x000000000040264d: _start at ??:?
       Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c)
      
      With this patch, it works well.
      
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fb12fe5
  4. Dec 19, 2021
    • Sean Christopherson's avatar
      KVM: x86: Retry page fault if MMU reload is pending and root has no sp · 18c841e1
      Sean Christopherson authored
      
      Play nice with a NULL shadow page when checking for an obsolete root in
      the page fault handler by flagging the page fault as stale if there's no
      shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
      Invalidating memslots, which is the only case where _all_ roots need to
      be reloaded, requests all vCPUs to reload their MMUs while holding
      mmu_lock for lock.
      
      The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
      backed by a shadow page.  Running with TDP disabled or with nested NPT
      explodes spectaculary due to dereferencing a NULL shadow page pointer.
      
      Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
      root.  Zapping shadow pages in response to guest activity, e.g. when the
      guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
      vCPU isn't using the affected root.  I.e. KVM_REQ_MMU_RELOAD can be seen
      with a completely valid root shadow page.  This is a bit of a moot point
      as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
      be cleaned up in the future.
      
      Fixes: a955cad8 ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211209060552.2956723-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      18c841e1
    • Vitaly Kuznetsov's avatar
      KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES · 1aa2abb3
      Vitaly Kuznetsov authored
      
      The ability to write to MSR_IA32_PERF_CAPABILITIES from the host should
      not depend on guest visible CPUID entries, even if just to allow
      creating/restoring guest MSRs and CPUIDs in any sequence.
      
      Fixes: 27461da3 ("KVM: x86/pmu: Support full width counting")
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211216165213.338923-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1aa2abb3
  5. Dec 17, 2021
  6. Dec 16, 2021
  7. Dec 14, 2021
  8. Dec 12, 2021
  9. Dec 10, 2021
    • Niklas Schnelle's avatar
      s390: enable switchdev support in defconfig · 5dcf0c30
      Niklas Schnelle authored
      
      The HiperSockets Converged Interface (HSCI) introduced with commit
      4e20e73e ("s390/qeth: Switchdev event handler") requires
      CONFIG_SWITCHDEV=y to be usable. Similarly when using Linux controlled
      SR-IOV capable PF devices with the mlx5_core driver CONFIG_SWITCHDEV=y
      as well as CONFIG_MLX5_ESWITCH=y are necessary to actually get link on
      the created VFs. So let's add these to the defconfig to make both types
      of devices usable. Note also that these options are already enabled in
      most current distribution kernels.
      
      Signed-off-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      5dcf0c30
    • Alexander Egorenkov's avatar
      s390/kexec: handle R_390_PLT32DBL rela in arch_kexec_apply_relocations_add() · abf0e8e4
      Alexander Egorenkov authored
      Starting with gcc 11.3, the C compiler will generate PLT-relative function
      calls even if they are local and do not require it. Later on during linking,
      the linker will replace all PLT-relative calls to local functions with
      PC-relative ones. Unfortunately, the purgatory code of kexec/kdump is
      not being linked as a regular executable or shared library would have been,
      and therefore, all PLT-relative addresses remain in the generated purgatory
      object code unresolved. This leads to the situation where the purgatory
      code is being executed during kdump with all PLT-relative addresses
      unresolved. And this results in endless loops within the purgatory code.
      
      Furthermore, the clang C compiler has always behaved like described above
      and this commit should fix kdump for kernels built with the latter.
      
      Because the purgatory code is no regular executable or shared library,
      contains only calls to local functions and has no PLT, all R_390_PLT32DBL
      relocation entries can be resolved just like a R_390_PC32DBL one.
      
      * https://refspecs.linuxfoundation.org/ELF/zSeries/lzsabi0_zSeries/x1633.html#AEN1699
      
      
      
      Relocation entries of purgatory code generated with gcc 11.3
      ------------------------------------------------------------
      
      $ readelf -r linux/arch/s390/purgatory/purgatory.o
      
      Relocation section '.rela.text' at offset 0x370 contains 5 entries:
        Offset          Info           Type           Sym. Value    Sym. Name + Addend
      00000000005c  000c00000013 R_390_PC32DBL     0000000000000000 purgatory_sha_regions + 2
      00000000007a  000d00000014 R_390_PLT32DBL    0000000000000000 sha256_update + 2
      00000000008c  000e00000014 R_390_PLT32DBL    0000000000000000 sha256_final + 2
      000000000092  000800000013 R_390_PC32DBL     0000000000000000 .LC0 + 2
      0000000000a0  000f00000014 R_390_PLT32DBL    0000000000000000 memcmp + 2
      
      Relocation entries of purgatory code generated with gcc 11.2
      ------------------------------------------------------------
      
      $ readelf -r linux/arch/s390/purgatory/purgatory.o
      
      Relocation section '.rela.text' at offset 0x368 contains 5 entries:
        Offset          Info           Type           Sym. Value    Sym. Name + Addend
      00000000005c  000c00000013 R_390_PC32DBL     0000000000000000 purgatory_sha_regions + 2
      00000000007a  000d00000013 R_390_PC32DBL     0000000000000000 sha256_update + 2
      00000000008c  000e00000013 R_390_PC32DBL     0000000000000000 sha256_final + 2
      000000000092  000800000013 R_390_PC32DBL     0000000000000000 .LC0 + 2
      0000000000a0  000f00000013 R_390_PC32DBL     0000000000000000 memcmp + 2
      
      Signed-off-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Reported-by: default avatarTao Liu <ltao@redhat.com>
      Suggested-by: default avatarPhilipp Rudo <prudo@redhat.com>
      Reviewed-by: default avatarPhilipp Rudo <prudo@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20211209073817.82196-1-egorenar@linux.ibm.com
      
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      abf0e8e4
    • Jerome Marchand's avatar
      s390/ftrace: remove preempt_disable()/preempt_enable() pair · ac8fc6af
      Jerome Marchand authored
      
      It looks like commit ce5e4803 ("ftrace: disable preemption
      when recursion locked") missed a spot in kprobe_ftrace_handler() in
      arch/s390/kernel/ftrace.c.
      Remove the superfluous preempt_disable/enable_notrace() there too.
      
      Fixes: ce5e4803 ("ftrace: disable preemption when recursion locked")
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Link: https://lore.kernel.org/r/20211208151503.1510381-1-jmarchan@redhat.com
      
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      ac8fc6af
    • Philipp Rudo's avatar
      s390/kexec_file: fix error handling when applying relocations · 41967a37
      Philipp Rudo authored
      
      arch_kexec_apply_relocations_add currently ignores all errors returned
      by arch_kexec_do_relocs. This means that every unknown relocation is
      silently skipped causing unpredictable behavior while the relocated code
      runs. Fix this by checking for errors and fail kexec_file_load if an
      unknown relocation type is encountered.
      
      The problem was found after gcc changed its behavior and used
      R_390_PLT32DBL relocations for brasl instruction and relied on ld to
      resolve the relocations in the final link in case direct calls are
      possible. As the purgatory code is only linked partially (option -r)
      ld didn't resolve the relocations leaving them for arch_kexec_do_relocs.
      But arch_kexec_do_relocs doesn't know how to handle R_390_PLT32DBL
      relocations so they were silently skipped. This ultimately caused an
      endless loop in the purgatory as the brasl instructions kept branching
      to itself.
      
      Fixes: 71406883 ("s390/kexec_file: Add kexec_file_load system call")
      Reported-by: default avatarTao Liu <ltao@redhat.com>
      Signed-off-by: default avatarPhilipp Rudo <prudo@redhat.com>
      Link: https://lore.kernel.org/r/20211208130741.5821-3-prudo@redhat.com
      
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      41967a37
    • Philipp Rudo's avatar
      s390/kexec_file: print some more error messages · edce10ee
      Philipp Rudo authored
      
      Be kind and give some more information on what went wrong.
      
      Signed-off-by: default avatarPhilipp Rudo <prudo@redhat.com>
      Link: https://lore.kernel.org/r/20211208130741.5821-2-prudo@redhat.com
      
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      edce10ee
    • Sean Christopherson's avatar
      KVM: x86: Don't WARN if userspace mucks with RCX during string I/O exit · d07898ea
      Sean Christopherson authored
      
      Replace a WARN with a comment to call out that userspace can modify RCX
      during an exit to userspace to handle string I/O.  KVM doesn't actually
      support changing the rep count during an exit, i.e. the scenario can be
      ignored, but the WARN needs to go as it's trivial to trigger from
      userspace.
      
      Cc: stable@vger.kernel.org
      Fixes: 3b27de27 ("KVM: x86: split the two parts of emulator_pio_in")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211025201311.1881846-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d07898ea
    • Lai Jiangshan's avatar
      KVM: X86: Raise #GP when clearing CR0_PG in 64 bit mode · 777ab82d
      Lai Jiangshan authored
      
      In the SDM:
      If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an
      attempt to clear CR0.PG causes a general-protection exception (#GP).
      Software should transition to compatibility mode and clear CR4.PCIDE
      before attempting to disable paging.
      
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211207095230.53437-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      777ab82d
    • Sean Christopherson's avatar
      KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req · 3244867a
      Sean Christopherson authored
      
      Do not bail early if there are no bits set in the sparse banks for a
      non-sparse, a.k.a. "all CPUs", IPI request.  Per the Hyper-V spec, it is
      legal to have a variable length of '0', e.g. VP_SET's BankContents in
      this case, if the request can be serviced without the extra info.
      
        It is possible that for a given invocation of a hypercall that does
        accept variable sized input headers that all the header input fits
        entirely within the fixed size header. In such cases the variable sized
        input header is zero-sized and the corresponding bits in the hypercall
        input should be set to zero.
      
      Bailing early results in KVM failing to send IPIs to all CPUs as expected
      by the guest.
      
      Fixes: 214ff83d ("KVM: x86: hyperv: implement PV IPI send hypercalls")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211207220926.718794-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3244867a
    • Vitaly Kuznetsov's avatar
      KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall · 1ebfaa11
      Vitaly Kuznetsov authored
      
      Prior to commit 0baedd79 ("KVM: x86: make Hyper-V PV TLB flush use
      tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH |
      KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs
      and KVM_REQ_TLB_FLUSH is/was defined as:
      
       (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
      
      so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that
      "This call guarantees that by the time control returns back to the
      caller, the observable effects of all flushes on the specified virtual
      processors have occurred." and without KVM_REQUEST_WAIT there's a small
      chance that the vCPU making the TLB flush will resume running before
      all IPIs get delivered to other vCPUs and a stale mapping can get read
      there.
      
      Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST:
      kvm_hv_flush_tlb() is the sole caller which uses it for
      kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where
      KVM_REQUEST_WAIT makes a difference.
      
      Cc: stable@kernel.org
      Fixes: 0baedd79 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211209102937.584397-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ebfaa11
  10. Dec 09, 2021
    • Tiezhu Yang's avatar
      MIPS: Only define pci_remap_iospace() for Ralink · 09d97da6
      Tiezhu Yang authored
      
      After commit 9f76779f ("MIPS: implement architecture-specific
      'pci_remap_iospace()'"), there exists the following warning on the
      Loongson64 platform:
      
          loongson-pci 1a000000.pci:       IO 0x0018020000..0x001803ffff -> 0x0000020000
          loongson-pci 1a000000.pci:      MEM 0x0040000000..0x007fffffff -> 0x0040000000
          ------------[ cut here ]------------
          WARNING: CPU: 2 PID: 1 at arch/mips/pci/pci-generic.c:55 pci_remap_iospace+0x84/0x90
          resource start address is not zero
          ...
          Call Trace:
          [<ffffffff8020dc78>] show_stack+0x40/0x120
          [<ffffffff80cf4a0c>] dump_stack_lvl+0x58/0x74
          [<ffffffff8023a0b0>] __warn+0xe0/0x110
          [<ffffffff80cee02c>] warn_slowpath_fmt+0xa4/0xd0
          [<ffffffff80cecf24>] pci_remap_iospace+0x84/0x90
          [<ffffffff807f9864>] devm_pci_remap_iospace+0x5c/0xb8
          [<ffffffff808121b0>] devm_of_pci_bridge_init+0x178/0x1f8
          [<ffffffff807f4000>] devm_pci_alloc_host_bridge+0x78/0x98
          [<ffffffff80819454>] loongson_pci_probe+0x34/0x160
          [<ffffffff809203cc>] platform_probe+0x6c/0xe0
          [<ffffffff8091d5d4>] really_probe+0xbc/0x340
          [<ffffffff8091d8f0>] __driver_probe_device+0x98/0x110
          [<ffffffff8091d9b8>] driver_probe_device+0x50/0x118
          [<ffffffff8091dea0>] __driver_attach+0x80/0x118
          [<ffffffff8091b280>] bus_for_each_dev+0x80/0xc8
          [<ffffffff8091c6d8>] bus_add_driver+0x130/0x210
          [<ffffffff8091ead4>] driver_register+0x8c/0x150
          [<ffffffff80200a8c>] do_one_initcall+0x54/0x288
          [<ffffffff811a5320>] kernel_init_freeable+0x27c/0x2e4
          [<ffffffff80cfc380>] kernel_init+0x2c/0x134
          [<ffffffff80205a2c>] ret_from_kernel_thread+0x14/0x1c
          ---[ end trace e4a0efe10aa5cce6 ]---
          loongson-pci 1a000000.pci: error -19: failed to map resource [io  0x20000-0x3ffff]
      
      We can see that the resource start address is 0x0000020000, because
      the ISA Bridge used the zero address which is defined in the dts file
      arch/mips/boot/dts/loongson/ls7a-pch.dtsi:
      
          ISA Bridge: /bus@10000000/isa@18000000
          IO 0x0000000018000000..0x000000001801ffff  ->  0x0000000000000000
      
      Based on the above analysis, the architecture-specific pci_remap_iospace()
      is not suitable for Loongson64, we should only define pci_remap_iospace()
      for Ralink on MIPS based on the commit background.
      
      Fixes: 9f76779f ("MIPS: implement architecture-specific 'pci_remap_iospace()'")
      Suggested-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Tested-by: default avatarSergio Paracuellos <sergio.paracuellos@gmail.com>
      Acked-by: default avatarSergio Paracuellos <sergio.paracuellos@gmail.com>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      09d97da6
  11. Dec 08, 2021
  12. Dec 07, 2021
  13. Dec 06, 2021
  14. Dec 05, 2021
    • Tom Lendacky's avatar
      x86/sme: Explicitly map new EFI memmap table as encrypted · 1ff2fc02
      Tom Lendacky authored
      Reserving memory using efi_mem_reserve() calls into the x86
      efi_arch_mem_reserve() function. This function will insert a new EFI
      memory descriptor into the EFI memory map representing the area of
      memory to be reserved and marking it as EFI runtime memory. As part
      of adding this new entry, a new EFI memory map is allocated and mapped.
      The mapping is where a problem can occur. This new memory map is mapped
      using early_memremap() and generally mapped encrypted, unless the new
      memory for the mapping happens to come from an area of memory that is
      marked as EFI_BOOT_SERVICES_DATA memory. In this case, the new memory will
      be mapped unencrypted. However, during replacement of the old memory map,
      efi_mem_type() is disabled, so the new memory map will now be long-term
      mapped encrypted (in efi.memmap), resulting in the map containing invalid
      data and causing the kernel boot to crash.
      
      Since it is known that the area will be mapped encrypted going forward,
      explicitly map the new memory map as encrypted using early_memremap_prot().
      
      Cc: <stable@vger.kernel.org> # 4.14.x
      Fixes: 8f716c9b ("x86/mm: Add support to access boot related data in the clear")
      Link: https://lore.kernel.org/all/ebf1eb2940405438a09d51d121ec0d02c8755558.1634752931.git.thomas.lendacky@amd.com/
      
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      [ardb: incorporate Kconfig fix by Arnd]
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      1ff2fc02
    • Tom Lendacky's avatar
      KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure · ad5b3532
      Tom Lendacky authored
      
      Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
      exit code or exit parameters fails.
      
      The VMGEXIT instruction can be issued from userspace, even though
      userspace (likely) can't update the GHCB. To prevent userspace from being
      able to kill the guest, return an error through the GHCB when validation
      fails rather than terminating the guest. For cases where the GHCB can't be
      updated (e.g. the GHCB can't be mapped, etc.), just return back to the
      guest.
      
      The new error codes are documented in the lasest update to the GHCB
      specification.
      
      Fixes: 291bd20d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad5b3532
Loading