Skip to content
Snippets Groups Projects
  1. Jun 18, 2021
    • Mikel Rychliski's avatar
      PCI: Add AMD RS690 quirk to enable 64-bit DMA · cacf994a
      Mikel Rychliski authored
      Although the AMD RS690 chipset has 64-bit DMA support, BIOS implementations
      sometimes fail to configure the memory limit registers correctly.
      
      The Acer F690GVM mainboard uses this chipset and a Marvell 88E8056 NIC. The
      sky2 driver programs the NIC to use 64-bit DMA, which will not work:
      
        sky2 0000:02:00.0: error interrupt status=0x8
        sky2 0000:02:00.0 eth0: tx timeout
        sky2 0000:02:00.0 eth0: transmit ring 0 .. 22 report=0 done=0
      
      Other drivers required by this mainboard either don't support 64-bit DMA,
      or have it disabled using driver specific quirks. For example, the ahci
      driver has quirks to enable or disable 64-bit DMA depending on the BIOS
      version (see ahci_sb600_enable_64bit() in ahci.c). This ahci quirk matches
      against the SB600 SATA controller, but the real issue is almost certainly
      with the RS690 PCI host that it was commonly attached to.
      
      To avoid this issue in all drivers with 64-bit DMA support, fix the
      configuration of the PCI host. If the kernel is aware of physical memory
      above 4GB, but the BIOS never configured the PCI host with this
      information, update the registers with our values.
      
      [bhelgaas: drop PCI_DEVICE_ID_ATI_RS690 definition]
      Link: https://lore.kernel.org/r/20210611214823.4898-1-mikel@mikelr.com
      
      
      Signed-off-by: default avatarMikel Rychliski <mikel@mikelr.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      cacf994a
  2. Jun 12, 2021
  3. Jun 11, 2021
    • Tor Vic's avatar
      x86, lto: Pass -stack-alignment only on LLD < 13.0.0 · 2398ce80
      Tor Vic authored
      Since LLVM commit 3787ee4, the '-stack-alignment' flag has been dropped
      [1], leading to the following error message when building a LTO kernel
      with Clang-13 and LLD-13:
      
          ld.lld: error: -plugin-opt=-: ld.lld: Unknown command line argument
          '-stack-alignment=8'.  Try 'ld.lld --help'
          ld.lld: Did you mean '--stackrealign=8'?
      
      It also appears that the '-code-model' flag is not necessary anymore
      starting with LLVM-9 [2].
      
      Drop '-code-model' and make '-stack-alignment' conditional on LLD < 13.0.0.
      
      These flags were necessary because these flags were not encoded in the
      IR properly, so the link would restart optimizations without them. Now
      there are properly encoded in the IR, and these flags exposing
      implementation details are no longer necessary.
      
      [1] https://reviews.llvm.org/D103048
      [2] https://reviews.llvm.org/D52322
      
      Cc: stable@vger.kernel.org
      Link: https://github.com/ClangBuiltLinux/linux/issues/1377
      
      
      Signed-off-by: default avatarTor Vic <torvic9@mailbox.org>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/f2c018ee-5999-741e-58d4-e482d5246067@mailbox.org
      2398ce80
    • Sean Christopherson's avatar
      KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU · 654430ef
      Sean Christopherson authored
      
      Calculate and check the full mmu_role when initializing the MMU context
      for the nested MMU, where "full" means the bits and pieces of the role
      that aren't handled by kvm_calc_mmu_role_common().  While the nested MMU
      isn't used for shadow paging, things like the number of levels in the
      guest's page tables are surprisingly important when walking the guest
      page tables.  Failure to reinitialize the nested MMU context if L2's
      paging mode changes can result in unexpected and/or missed page faults,
      and likely other explosions.
      
      E.g. if an L1 vCPU is running both a 32-bit PAE L2 and a 64-bit L2, the
      "common" role calculation will yield the same role for both L2s.  If the
      64-bit L2 is run after the 32-bit PAE L2, L0 will fail to reinitialize
      the nested MMU context, ultimately resulting in a bad walk of L2's page
      tables as the MMU will still have a guest root_level of PT32E_ROOT_LEVEL.
      
        WARNING: CPU: 4 PID: 167334 at arch/x86/kvm/vmx/vmx.c:3075 ept_save_pdptrs+0x15/0xe0 [kvm_intel]
        Modules linked in: kvm_intel]
        CPU: 4 PID: 167334 Comm: CPU 3/KVM Not tainted 5.13.0-rc1-d849817d5673-reqs #185
        Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
        RIP: 0010:ept_save_pdptrs+0x15/0xe0 [kvm_intel]
        Code: <0f> 0b c3 f6 87 d8 02 00f
        RSP: 0018:ffffbba702dbba00 EFLAGS: 00010202
        RAX: 0000000000000011 RBX: 0000000000000002 RCX: ffffffff810a2c08
        RDX: ffff91d7bc30acc0 RSI: 0000000000000011 RDI: ffff91d7bc30a600
        RBP: ffff91d7bc30a600 R08: 0000000000000010 R09: 0000000000000007
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff91d7bc30a600
        R13: ffff91d7bc30acc0 R14: ffff91d67c123460 R15: 0000000115d7e005
        FS:  00007fe8e9ffb700(0000) GS:ffff91d90fb00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000029f15a001 CR4: 00000000001726e0
        Call Trace:
         kvm_pdptr_read+0x3a/0x40 [kvm]
         paging64_walk_addr_generic+0x327/0x6a0 [kvm]
         paging64_gva_to_gpa_nested+0x3f/0xb0 [kvm]
         kvm_fetch_guest_virt+0x4c/0xb0 [kvm]
         __do_insn_fetch_bytes+0x11a/0x1f0 [kvm]
         x86_decode_insn+0x787/0x1490 [kvm]
         x86_decode_emulated_instruction+0x58/0x1e0 [kvm]
         x86_emulate_instruction+0x122/0x4f0 [kvm]
         vmx_handle_exit+0x120/0x660 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0xe25/0x1cb0 [kvm]
         kvm_vcpu_ioctl+0x211/0x5a0 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x40/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: bf627a92 ("x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210610220026.1364486-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      654430ef
    • Wanpeng Li's avatar
      KVM: X86: Fix x86_emulator slab cache leak · dfdc0a71
      Wanpeng Li authored
      
      Commit c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      tries to allocate per-vCPU emulation context dynamically, however, the
      x86_emulator slab cache is still exiting after the kvm module is unload
      as below after destroying the VM and unloading the kvm module.
      
      grep x86_emulator /proc/slabinfo
      x86_emulator          36     36   2672   12    8 : tunables    0    0    0 : slabdata      3      3      0
      
      This patch fixes this slab cache leak by destroying the x86_emulator slab cache
      when the kvm module is unloaded.
      
      Fixes: c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dfdc0a71
    • Alper Gun's avatar
      KVM: SVM: Call SEV Guest Decommission if ASID binding fails · 934002cd
      Alper Gun authored
      
      Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding
      fails. If a failure happens after  a successful LAUNCH_START command,
      a decommission command should be executed. Otherwise, guest context
      will be unfreed inside the AMD SP. After the firmware will not have
      memory to allocate more SEV guest context, LAUNCH_START command will
      begin to fail with SEV_RET_RESOURCE_LIMIT error.
      
      The existing code calls decommission inside sev_unbind_asid, but it is
      not called if a failure happens before guest activation succeeds. If
      sev_bind_asid fails, decommission is never called. PSP firmware has a
      limit for the number of guests. If sev_asid_binding fails many times,
      PSP firmware will not have resources to create another guest context.
      
      Cc: stable@vger.kernel.org
      Fixes: 59414c98 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command")
      Reported-by: default avatarPeter Gonda <pgonda@google.com>
      Signed-off-by: default avatarAlper Gun <alpergun@google.com>
      Reviewed-by: default avatarMarc Orr <marcorr@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210610174604.2554090-1-alpergun@google.com>
      934002cd
    • Vitaly Wool's avatar
      riscv: alternative: fix typo in macro name · 858cf860
      Vitaly Wool authored
      
      alternative-macros.h defines ALT_NEW_CONTENT in its assembly part
      and ALT_NEW_CONSTENT in the C part. Most likely it is the latter
      that is wrong.
      
      Fixes: 6f4eea90
      	(riscv: Introduce alternative mechanism to apply errata solution)
      Signed-off-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      858cf860
    • Vineet Gupta's avatar
      ARC: fix CONFIG_HARDENED_USERCOPY · 110febc0
      Vineet Gupta authored
      Currently enabling this triggers a warning
      
      | usercopy: Kernel memory overwrite attempt detected to kernel text (offset 155633, size 11)!
      | usercopy: BUG: failure at mm/usercopy.c:99/usercopy_abort()!
      |
      |gcc generated __builtin_trap
      |Path: /bin/busybox
      |CPU: 0 PID: 84 Comm: init Not tainted 5.4.22
      |
      |[ECR ]: 0x00090005 => gcc generated __builtin_trap
      |[EFA ]: 0x9024fcaa
      |[BLINK ]: usercopy_abort+0x8a/0x8c
      |[ERET ]: memfd_fcntl+0x0/0x470
      |[STAT32]: 0x80080802 : IE K
      |...
      |...
      |Stack Trace:
      | memfd_fcntl+0x0/0x470
      | usercopy_abort+0x8a/0x8c
      | __check_object_size+0x10e/0x138
      | copy_strings+0x1f4/0x38c
      | __do_execve_file+0x352/0x848
      | EV_Trap+0xcc/0xd0
      
      The issue is triggered by an allocation in "init reclaimed" region.
      ARC _stext emcompasses the init region (for historical reasons we wanted
      the init.text to be under .text as well). This however trips up
      __check_object_size()->check_kernel_text_object() which treats this as
      object bleeding into kernel text.
      
      Fix that by rezoning _stext to start from regular kernel .text and leave
      out .init altogether.
      
      Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/15
      
      
      Reported-by: default avatarEvgeniy Didin <didin@synopsys.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      110febc0
    • Vineet Gupta's avatar
      ARCv2: save ABI registers across signal handling · 96f1b001
      Vineet Gupta authored
      ARCv2 has some configuration dependent registers (r30, r58, r59) which
      could be targetted by the compiler. To keep the ABI stable, these were
      unconditionally part of the glibc ABI
      (sysdeps/unix/sysv/linux/arc/sys/ucontext.h:mcontext_t) however we
      missed populating them (by saving/restoring them across signal
      handling).
      
      This patch fixes the issue by
       - adding arcv2 ABI regs to kernel struct sigcontext
       - populating them during signal handling
      
      Change to struct sigcontext might seem like a glibc ABI change (although
      it primarily uses ucontext_t:mcontext_t) but the fact is
       - it has only been extended (existing fields are not touched)
       - the old sigcontext was ABI incomplete to begin with anyways
      
      Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/53
      
      
      Cc: <stable@vger.kernel.org>
      Tested-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarVladimir Isaev <isaev@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      96f1b001
  4. Jun 10, 2021
    • Jisheng Zhang's avatar
      riscv: code patching only works on !XIP_KERNEL · 42e0e0b4
      Jisheng Zhang authored
      
      Some features which need code patching such as KPROBES, DYNAMIC_FTRACE
      KGDB can only work on !XIP_KERNEL. Add dependencies for these features
      that rely on code patching.
      
      Signed-off-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      42e0e0b4
    • Vitaly Wool's avatar
      riscv: xip: support runtime trap patching · 5e63215c
      Vitaly Wool authored
      
      RISCV_ERRATA_ALTERNATIVE patches text at runtime which is currently
      not possible when the kernel is executed from the flash in XIP mode.
      Since runtime patching concerns only traps at the moment, let's just
      have all the traps reside in RAM anyway if RISCV_ERRATA_ALTERNATIVE
      is set. Thus, these functions will be patch-able even when the .text
      section is in flash.
      
      Signed-off-by: default avatarVitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      5e63215c
    • Sean Christopherson's avatar
      KVM: x86: Immediately reset the MMU context when the SMM flag is cleared · 78fcb2c9
      Sean Christopherson authored
      
      Immediately reset the MMU context when the vCPU's SMM flag is cleared so
      that the SMM flag in the MMU role is always synchronized with the vCPU's
      flag.  If RSM fails (which isn't correctly emulated), KVM will bail
      without calling post_leave_smm() and leave the MMU in a bad state.
      
      The bad MMU role can lead to a NULL pointer dereference when grabbing a
      shadow page's rmap for a page fault as the initial lookups for the gfn
      will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
      use the shadow page's SMM flag, which comes from the MMU (=1).  SMM has
      an entirely different set of memslots, and so the initial lookup can find
      a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).
      
        general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
        KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
        CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
        RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
        Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
        RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
        RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
        R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
        R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
        FS:  000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
         mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
         __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
         direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
         kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
         kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
         vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
         vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
         vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
         kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
         kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:1069 [inline]
         __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x440ce9
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com>
      Fixes: 9ec19493 ("KVM: x86: clear SMM flags before loading state while leaving SMM")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      78fcb2c9
    • Gustavo A. R. Silva's avatar
      KVM: x86: Fix fall-through warnings for Clang · 551912d2
      Gustavo A. R. Silva authored
      In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
      of warnings by explicitly adding break statements instead of just letting
      the code fall through to the next case.
      
      Link: https://github.com/KSPP/linux/issues/115
      
      
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Message-Id: <20210528200756.GA39320@embeddedor>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      551912d2
    • ChenXiaoSong's avatar
      KVM: SVM: fix doc warnings · 02ffbe63
      ChenXiaoSong authored
      
      Fix kernel-doc warnings:
      
      arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'activate' not described in 'avic_update_access_page'
      arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'kvm' not described in 'avic_update_access_page'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'e' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'kvm' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'svm' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'vcpu_info' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:1009: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
      
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Message-Id: <20210609122217.2967131-1-chenxiaosong2@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      02ffbe63
    • CodyYao-oc's avatar
      x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs · a8383dfb
      CodyYao-oc authored
      
      The following commit:
      
         3a4ac121 ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
      
      Got the old-style NMI watchdog logic wrong and broke it for basically every
      Intel CPU where it was active. Which is only truly old CPUs, so few people noticed.
      
      On CPUs with perf events support we turn off the old-style NMI watchdog, so it
      was pretty pointless to add the logic for X86_VENDOR_ZHAOXIN to begin with ... :-/
      
      Anyway, the fix is to restore the old logic and add a 'break'.
      
      [ mingo: Wrote a new changelog. ]
      
      Fixes: 3a4ac121 ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
      Signed-off-by: default avatarCodyYao-oc <CodyYao-oc@zhaoxin.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210607025335.9643-1-CodyYao-oc@zhaoxin.com
      a8383dfb
  5. Jun 09, 2021
  6. Jun 08, 2021
    • Lai Jiangshan's avatar
      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync · b53e84ee
      Lai Jiangshan authored
      
      When using shadow paging, unload the guest MMU when emulating a guest TLB
      flush to ensure all roots are synchronized.  From the guest's perspective,
      flushing the TLB ensures any and all modifications to its PTEs will be
      recognized by the CPU.
      
      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
      can be done to more precisely synchronize roots when servicing a guest
      TLB flush.
      
      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
      TDP mappings.  For EPT, an explicit INVEPT is required to invalidate
      guest-physical mappings; for NPT, guest mappings are always tagged with
      an ASID and thus can only be invalidated via the VMCB's ASID control.
      
      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
      It was only recently exposed after Linux guests stopped flushing the
      local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eab,
      "x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
      visible in Windows 10 guests.
      
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: f38a7b75 ("KVM: X86: support paravirtualized help for TLB shootdowns")
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      [sean: massaged comment and changelog]
      Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b53e84ee
    • Sean Christopherson's avatar
      KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message · f31500b0
      Sean Christopherson authored
      
      Use the __string() machinery provided by the tracing subystem to make a
      copy of the string literals consumed by the "nested VM-Enter failed"
      tracepoint.  A complete copy is necessary to ensure that the tracepoint
      can't outlive the data/memory it consumes and deference stale memory.
      
      Because the tracepoint itself is defined by kvm, if kvm-intel and/or
      kvm-amd are built as modules, the memory holding the string literals
      defined by the vendor modules will be freed when the module is unloaded,
      whereas the tracepoint and its data in the ring buffer will live until
      kvm is unloaded (or "indefinitely" if kvm is built-in).
      
      This bug has existed since the tracepoint was added, but was recently
      exposed by a new check in tracing to detect exactly this type of bug.
      
        fmt: '%s%s
        ' current_buffer: ' vmx_dirty_log_t-140127  [003] ....  kvm_nested_vmenter_failed: '
        WARNING: CPU: 3 PID: 140134 at kernel/trace/trace.c:3759 trace_check_vprintf+0x3be/0x3e0
        CPU: 3 PID: 140134 Comm: less Not tainted 5.13.0-rc1-ce2e73ce600a-req #184
        Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
        RIP: 0010:trace_check_vprintf+0x3be/0x3e0
        Code: <0f> 0b 44 8b 4c 24 1c e9 a9 fe ff ff c6 44 02 ff 00 49 8b 97 b0 20
        RSP: 0018:ffffa895cc37bcb0 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffffa895cc37bd08 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff9766cfad74f8
        RBP: ffffffffc0a041d4 R08: ffff9766cfad74f0 R09: ffffa895cc37bad8
        R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffc0a041d4
        R13: ffffffffc0f4dba8 R14: 0000000000000000 R15: ffff976409f2c000
        FS:  00007f92fa200740(0000) GS:ffff9766cfac0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559bd11b0000 CR3: 000000019fbaa002 CR4: 00000000001726e0
        Call Trace:
         trace_event_printf+0x5e/0x80
         trace_raw_output_kvm_nested_vmenter_failed+0x3a/0x60 [kvm]
         print_trace_line+0x1dd/0x4e0
         s_show+0x45/0x150
         seq_read_iter+0x2d5/0x4c0
         seq_read+0x106/0x150
         vfs_read+0x98/0x180
         ksys_read+0x5f/0xe0
         do_syscall_64+0x40/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Fixes: 380e0055 ("KVM: nVMX: trace nested VM-Enter failures detected by H/W")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Message-Id: <20210607175748.674002-1-seanjc@google.com>
      f31500b0
    • Lai Jiangshan's avatar
      KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior · af3511ff
      Lai Jiangshan authored
      
      In record_steal_time(), st->preempted is read twice, and
      trace_kvm_pv_tlb_flush() might output result inconsistent if
      kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
      
      It is a very trivial problem and hardly has actual harm and can be
      avoided by reseting and reading st->preempted in atomic way via xchg().
      
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      
      Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af3511ff
    • Lai Jiangshan's avatar
      KVM: X86: MMU: Use the correct inherited permissions to get shadow page · b1bd5cba
      Lai Jiangshan authored
      When computing the access permissions of a shadow page, use the effective
      permissions of the walk up to that point, i.e. the logic AND of its parents'
      permissions.  Two guest PxE entries that point at the same table gfn need to
      be shadowed with different shadow pages if their parents' permissions are
      different.  KVM currently uses the effective permissions of the last
      non-leaf entry for all non-leaf entries.  Because all non-leaf SPTEs have
      full ("uwx") permissions, and the effective permissions are recorded only
      in role.access and merged into the leaves, this can lead to incorrect
      reuse of a shadow page and eventually to a missing guest protection page
      fault.
      
      For example, here is a shared pagetable:
      
         pgd[]   pud[]        pmd[]            virtual address pointers
                           /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
              /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
         pgd-|           (shared pmd[] as above)
              \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
                           \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
      
        pud1 and pud2 point to the same pmd table, so:
        - ptr1 and ptr3 points to the same page.
        - ptr2 and ptr4 points to the same page.
      
      (pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
      
      - First, the guest reads from ptr1 first and KVM prepares a shadow
        page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
        "u--" comes from the effective permissions of pgd, pud1 and
        pmd1, which are stored in pt->access.  "u--" is used also to get
        the pagetable for pud1, instead of "uw-".
      
      - Then the guest writes to ptr2 and KVM reuses pud1 which is present.
        The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
        even though the pud1 pmd (because of the incorrect argument to
        kvm_mmu_get_page in the previous step) has role.access="u--".
      
      - Then the guest reads from ptr3.  The hypervisor reuses pud1's
        shadow pmd for pud2, because both use "u--" for their permissions.
        Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
      
      - At last, the guest writes to ptr4.  This causes no vmexit or pagefault,
        because pud1's shadow page structures included an "uw-" page even though
        its role.access was "u--".
      
      Any kind of shared pagetable might have the similar problem when in
      virtual machine without TDP enabled if the permissions are different
      from different ancestors.
      
      In order to fix the problem, we change pt->access to be an array, and
      any access in it will not include permissions ANDed from child ptes.
      
      The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
      
      
      Remember to test it with TDP disabled.
      
      The problem had existed long before the commit 41074d07 ("KVM: MMU:
      Fix inherited permissions for emulated guest pte updates"), and it
      is hard to find which is the culprit.  So there is no fixes tag here.
      
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      Fixes: cea0f0e7 ("[PATCH] KVM: MMU: Shadow page table caching")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b1bd5cba
    • Wanpeng Li's avatar
      KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer · e898da78
      Wanpeng Li authored
      
      According to the SDM 10.5.4.1:
      
        A write of 0 to the initial-count register effectively stops the local
        APIC timer, in both one-shot and periodic mode.
      
      However, the lapic timer oneshot/periodic mode which is emulated by vmx-preemption
      timer doesn't stop by writing 0 to TMICT since vmx->hv_deadline_tsc is still
      programmed and the guest will receive the spurious timer interrupt later. This
      patch fixes it by also cancelling the vmx-preemption timer when writing 0 to
      the initial-count register.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623050385-100988-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e898da78
    • Ashish Kalra's avatar
      KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length... · 4f13d471
      Ashish Kalra authored
      KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca82
      
      Commit 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack")
      uses the local stack to allocate the structures used to communicate with the PSP,
      which were earlier being kzalloced. This breaks SEV live migration for
      computing the SEND_START session length and SEND_UPDATE_DATA query length as
      session_len and trans_len and hdr_len fields are not zeroed respectively for
      the above commands before issuing the SEV Firmware API call, hence the
      firmware returns incorrect session length and update data header or trans length.
      
      Also the SEV Firmware API returns SEV_RET_INVALID_LEN firmware error
      for these length query API calls, and the return value and the
      firmware error needs to be passed to the userspace as it is, so
      need to remove the return check in the KVM code.
      
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <20210607061532.27459-1-Ashish.Kalra@amd.com>
      Fixes: 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f13d471
  7. Jun 05, 2021
  8. Jun 04, 2021
  9. Jun 03, 2021
    • Mike Rapoport's avatar
      x86/setup: Always reserve the first 1M of RAM · f1d4d47c
      Mike Rapoport authored
      
      There are BIOSes that are known to corrupt the memory under 1M, or more
      precisely under 640K because the memory above 640K is anyway reserved
      for the EGA/VGA frame buffer and BIOS.
      
      To prevent usage of the memory that will be potentially clobbered by the
      kernel, the beginning of the memory is always reserved. The exact size
      of the reserved area is determined by CONFIG_X86_RESERVE_LOW build time
      and the "reservelow=" command line option. The reserved range may be
      from 4K to 640K with the default of 64K. There are also configurations
      that reserve the entire 1M range, like machines with SandyBridge graphic
      devices or systems that enable crash kernel.
      
      In addition to the potentially clobbered memory, EBDA of unknown size may
      be as low as 128K and the memory above that EBDA start is also reserved
      early.
      
      It would have been possible to reserve the entire range under 1M unless for
      the real mode trampoline that must reside in that area.
      
      To accommodate placement of the real mode trampoline and keep the memory
      safe from being clobbered by BIOS, reserve the first 64K of RAM before
      memory allocations are possible and then, after the real mode trampoline
      is allocated, reserve the entire range from 0 to 1M.
      
      Update trim_snb_memory() and reserve_real_mode() to avoid redundant
      reservations of the same memory range.
      
      Also make sure the memory under 1M is not getting freed by
      efi_free_boot_services().
      
       [ bp: Massage commit message and comments. ]
      
      Fixes: a799c2bd ("x86/setup: Consolidate early memory reservations")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213177
      Link: https://lkml.kernel.org/r/20210601075354.5149-2-rppt@kernel.org
      f1d4d47c
    • Borislav Petkov's avatar
      x86/alternative: Optimize single-byte NOPs at an arbitrary position · 2b31e8ed
      Borislav Petkov authored
      
      Up until now the assumption was that an alternative patching site would
      have some instructions at the beginning and trailing single-byte NOPs
      (0x90) padding. Therefore, the patching machinery would go and optimize
      those single-byte NOPs into longer ones.
      
      However, this assumption is broken on 32-bit when code like
      hv_do_hypercall() in hyperv_init() would use the ratpoline speculation
      killer CALL_NOSPEC. The 32-bit version of that macro would align certain
      insns to 16 bytes, leading to the compiler issuing a one or more
      single-byte NOPs, depending on the holes it needs to fill for alignment.
      
      That would lead to the warning in optimize_nops() to fire:
      
        ------------[ cut here ]------------
        Not a NOP at 0xc27fb598
         WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:211 optimize_nops.isra.13
      
      due to that function verifying whether all of the following bytes really
      are single-byte NOPs.
      
      Therefore, carve out the NOP padding into a separate function and call
      it for each NOP range beginning with a single-byte NOP.
      
      Fixes: 23c1ad53 ("x86/alternatives: Optimize optimize_nops()")
      Reported-by: default avatarRichard Narron <richard@aaazen.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213301
      Link: https://lkml.kernel.org/r/20210601212125.17145-1-bp@alien8.de
      2b31e8ed
    • Thomas Gleixner's avatar
      x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid() · 9bfecd05
      Thomas Gleixner authored
      
      While digesting the XSAVE-related horrors which got introduced with
      the supervisor/user split, the recent addition of ENQCMD-related
      functionality got on the radar and turned out to be similarly broken.
      
      update_pasid(), which is only required when X86_FEATURE_ENQCMD is
      available, is invoked from two places:
      
       1) From switch_to() for the incoming task
      
       2) Via a SMP function call from the IOMMU/SMV code
      
      #1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
         by enforcing the state to be 'present', but all the conditionals in that
         code are completely pointless for that.
      
         Also the invocation is just useless overhead because at that point
         it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
         and all of this can be handled at return to user space.
      
      #2 is broken beyond repair. The comment in the code claims that it is safe
         to invoke this in an IPI, but that's just wishful thinking.
      
         FPU state of a running task is protected by fregs_lock() which is
         nothing else than a local_bh_disable(). As BH-disabled regions run
         usually with interrupts enabled the IPI can hit a code section which
         modifies FPU state and there is absolutely no guarantee that any of the
         assumptions which are made for the IPI case is true.
      
         Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
         invoked with a NULL pointer argument, so it can hit a completely
         unrelated task and unconditionally force an update for nothing.
         Worse, it can hit a kernel thread which operates on a user space
         address space and set a random PASID for it.
      
      The offending commit does not cleanly revert, but it's sufficient to
      force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
      code to make this dysfunctional all over the place. Anything more
      complex would require more surgery and none of the related functions
      outside of the x86 core code are blatantly wrong, so removing those
      would be overkill.
      
      As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
      required to make this actually work, this cannot result in a regression
      except for related out of tree train-wrecks, but they are broken already
      today.
      
      Fixes: 20f0afd1 ("x86/mmu: Allocate/free a PASID")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
      9bfecd05
  10. Jun 02, 2021
    • Arnd Bergmann's avatar
      ARM: cpuidle: Avoid orphan section warning · d94b93a9
      Arnd Bergmann authored
      
      Since commit 83109d5d ("x86/build: Warn on orphan section placement"),
      we get a warning for objects in orphan sections. The cpuidle implementation
      for OMAP causes this when CONFIG_CPU_IDLE is disabled:
      
      arm-linux-gnueabi-ld: warning: orphan section `__cpuidle_method_of_table' from `arch/arm/mach-omap2/pm33xx-core.o' being placed in section `__cpuidle_method_of_table'
      arm-linux-gnueabi-ld: warning: orphan section `__cpuidle_method_of_table' from `arch/arm/mach-omap2/pm33xx-core.o' being placed in section `__cpuidle_method_of_table'
      arm-linux-gnueabi-ld: warning: orphan section `__cpuidle_method_of_table' from `arch/arm/mach-omap2/pm33xx-core.o' being placed in section `__cpuidle_method_of_table'
      
      Change the definition of CPUIDLE_METHOD_OF_DECLARE() to silently
      drop the table and all code referenced from it when CONFIG_CPU_IDLE
      is disabled.
      
      Fixes: 06ee7a95 ("ARM: OMAP2+: pm33xx-core: Add cpuidle_ops for am335x/am437x")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20201230155506.1085689-1-arnd@kernel.org
      d94b93a9
    • Wende Tan's avatar
      RISC-V: Fix memblock_free() usages in init_resources() · da2d4880
      Wende Tan authored
      
      `memblock_free()` takes a physical address as its first argument.
      Fix the wrong usages in `init_resources()`.
      
      Fixes: ffe0e526 ("RISC-V: Improve init_resources()")
      Fixes: 797f0375 ("RISC-V: Do not allocate memblock while iterating reserved memblocks")
      Signed-off-by: default avatarWende Tan <twd2.me@gmail.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      da2d4880
    • Vincent's avatar
      riscv: skip errata_cip_453.o if CONFIG_ERRATA_SIFIVE_CIP_453 is disabled · b75db25c
      Vincent authored
      
      The errata_cip_453.o should be built only when the Kconfig
      CONFIG_ERRATA_SIFIVE_CIP_453 is enabled.
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarVincent <vincent.chen@sifive.com>
      Fixes: 0e0d4992 ("riscv: enable SiFive errata CIP-453 and CIP-1200 Kconfig only if CONFIG_64BIT=y")
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      b75db25c
    • Jisheng Zhang's avatar
      riscv: mm: Fix W+X mappings at boot · 8a4102a0
      Jisheng Zhang authored
      
      When the kernel mapping was moved the last 2GB of the address space,
      (__va(PFN_PHYS(max_low_pfn))) is much smaller than the .data section
      start address, the last set_memory_nx() in protect_kernel_text_data()
      will fail, thus the .data section is still mapped as W+X. This results
      in below W+X mapping waring at boot. Fix it by passing the correct
      .data section page num to the set_memory_nx().
      
      [    0.396516] ------------[ cut here ]------------
      [    0.396889] riscv/mm: Found insecure W+X mapping at address (____ptrval____)/0xffffffff80c00000
      [    0.398347] WARNING: CPU: 0 PID: 1 at arch/riscv/mm/ptdump.c:258 note_page+0x244/0x24a
      [    0.398964] Modules linked in:
      [    0.399459] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc1+ #14
      [    0.400003] Hardware name: riscv-virtio,qemu (DT)
      [    0.400591] epc : note_page+0x244/0x24a
      [    0.401368]  ra : note_page+0x244/0x24a
      [    0.401772] epc : ffffffff80007c86 ra : ffffffff80007c86 sp : ffffffe000e7bc30
      [    0.402304]  gp : ffffffff80caae88 tp : ffffffe000e70000 t0 : ffffffff80cb80cf
      [    0.402800]  t1 : ffffffff80cb80c0 t2 : 0000000000000000 s0 : ffffffe000e7bc80
      [    0.403310]  s1 : ffffffe000e7bde8 a0 : 0000000000000053 a1 : ffffffff80c83ff0
      [    0.403805]  a2 : 0000000000000010 a3 : 0000000000000000 a4 : 6c7e7a5137233100
      [    0.404298]  a5 : 6c7e7a5137233100 a6 : 0000000000000030 a7 : ffffffffffffffff
      [    0.404849]  s2 : ffffffff80e00000 s3 : 0000000040000000 s4 : 0000000000000000
      [    0.405393]  s5 : 0000000000000000 s6 : 0000000000000003 s7 : ffffffe000e7bd48
      [    0.405935]  s8 : ffffffff81000000 s9 : ffffffffc0000000 s10: ffffffe000e7bd48
      [    0.406476]  s11: 0000000000001000 t3 : 0000000000000072 t4 : ffffffffffffffff
      [    0.407016]  t5 : 0000000000000002 t6 : ffffffe000e7b978
      [    0.407435] status: 0000000000000120 badaddr: 0000000000000000 cause: 0000000000000003
      [    0.408052] Call Trace:
      [    0.408343] [<ffffffff80007c86>] note_page+0x244/0x24a
      [    0.408855] [<ffffffff8010c5a6>] ptdump_hole+0x14/0x1e
      [    0.409263] [<ffffffff800f65c6>] walk_pgd_range+0x2a0/0x376
      [    0.409690] [<ffffffff800f6828>] walk_page_range_novma+0x4e/0x6e
      [    0.410146] [<ffffffff8010c5f8>] ptdump_walk_pgd+0x48/0x78
      [    0.410570] [<ffffffff80007d66>] ptdump_check_wx+0xb4/0xf8
      [    0.410990] [<ffffffff80006738>] mark_rodata_ro+0x26/0x2e
      [    0.411407] [<ffffffff8031961e>] kernel_init+0x44/0x108
      [    0.411814] [<ffffffff80002312>] ret_from_exception+0x0/0xc
      [    0.412309] ---[ end trace 7ec3459f2547ea83 ]---
      [    0.413141] Checked W+X mappings: failed, 512 W+X pages found
      
      Fixes: 2bfc6cd8 ("riscv: Move kernel mapping outside of linear mapping")
      Signed-off-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      8a4102a0
  11. Jun 01, 2021
  12. May 31, 2021
    • Borislav Petkov's avatar
      x86/thermal: Fix LVT thermal setup for SMI delivery mode · 9a90ed06
      Borislav Petkov authored
      
      There are machines out there with added value crap^WBIOS which provide an
      SMI handler for the local APIC thermal sensor interrupt. Out of reset,
      the BSP on those machines has something like 0x200 in that APIC register
      (timestamps left in because this whole issue is timing sensitive):
      
        [    0.033858] read lvtthmr: 0x330, val: 0x200
      
      which means:
      
       - bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
       - bits [10:8] have 010b which means SMI delivery mode.
      
      Now, later during boot, when the kernel programs the local APIC, it
      soft-disables it temporarily through the spurious vector register:
      
        setup_local_APIC:
      
        	...
      
      	/*
      	 * If this comes from kexec/kcrash the APIC might be enabled in
      	 * SPIV. Soft disable it before doing further initialization.
      	 */
      	value = apic_read(APIC_SPIV);
      	value &= ~APIC_SPIV_APIC_ENABLED;
      	apic_write(APIC_SPIV, value);
      
      which means (from the SDM):
      
      "10.4.7.2 Local APIC State After It Has Been Software Disabled
      
      ...
      
      * The mask bits for all the LVT entries are set. Attempts to reset these
      bits will be ignored."
      
      And this happens too:
      
        [    0.124111] APIC: Switch to symmetric I/O mode setup
        [    0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
        [    0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
      
      This results in CPU 0 soft lockups depending on the placement in time
      when the APIC soft-disable happens. Those soft lockups are not 100%
      reproducible and the reason for that can only be speculated as no one
      tells you what SMM does. Likely, it confuses the SMM code that the APIC
      is disabled and the thermal interrupt doesn't doesn't fire at all,
      leading to CPU 0 stuck in SMM forever...
      
      Now, before
      
        4f432e8b ("x86/mce: Get rid of mcheck_intel_therm_init()")
      
      due to how the APIC_LVTTHMR was read before APIC initialization in
      mcheck_intel_therm_init(), it would read the value with the mask bit 16
      clear and then intel_init_thermal() would replicate it onto the APs and
      all would be peachy - the thermal interrupt would remain enabled.
      
      But that commit moved that reading to a later moment in
      intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
      late and with its interrupt mask bit set.
      
      Thus, revert back to the old behavior of reading the thermal LVT
      register before the APIC gets initialized.
      
      Fixes: 4f432e8b ("x86/mce: Get rid of mcheck_intel_therm_init()")
      Reported-by: default avatarJames Feeney <james@nurealm.net>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
      9a90ed06
    • Kan Liang's avatar
      perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1 · 4a0e3ff3
      Kan Liang authored
      
      A kernel WARNING may be triggered when setting maxcpus=1.
      
      The uncore counters are Die-scope. When probing a PCI device, only the
      BUS information can be retrieved. The uncore driver has to maintain a
      mapping table used to calculate the logical Die ID from a given BUS#.
      
      Before the patch ba9506be, the mapping table stores the mapping
      information from the BUS# -> a Physical Socket ID. To calculate the
      logical die ID, perf does,
      - In snbep_pci2phy_map_init(), retrieve the BUS# -> a Physical Socket ID
        from the UBOX PCI configure space.
      - Calculate the mapping information (a BUS# -> a Physical Socket ID) for
        the other PCI BUS.
      - In the uncore_pci_probe(), get the physical Socket ID from a given BUS
        and the mapping table.
      - Calculate the logical Die ID
      
      Since only the logical Die ID is required, with the patch ba9506be,
      the mapping table stores the mapping information from the BUS# -> a
      logical Die ID. Now perf does,
      - In snbep_pci2phy_map_init(), retrieve the BUS# -> a Physical Socket ID
        from the UBOX PCI configure space.
      - Calculate the logical Die ID
      - Calculate the mapping information (a BUS# -> a logical Die ID) for the
        other PCI BUS.
      - In the uncore_pci_probe(), get the logical die ID from a given BUS and
        the mapping table.
      
      When calculating the logical Die ID, -1 may be returned, especially when
      maxcpus=1. Here, -1 means the logical Die ID is not found. But when
      calculating the mapping information for the other PCI BUS, -1 indicates
      that it's the other PCI BUS that requires the calculation of the
      mapping. The driver will mistakenly do the calculation.
      
      Uses the -ENODEV to indicate the case which the logical Die ID is not
      found. The driver will not mess up the mapping table anymore.
      
      Fixes: ba9506be ("perf/x86/intel/uncore: Store the logical die id instead of the physical die id.")
      Reported-by: default avatarJohn Donnelly <john.p.donnelly@oracle.com>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJohn Donnelly <john.p.donnelly@oracle.com>
      Tested-by: default avatarJohn Donnelly <john.p.donnelly@oracle.com>
      Link: https://lkml.kernel.org/r/1622037527-156028-1-git-send-email-kan.liang@linux.intel.com
      4a0e3ff3
    • Jerome Brunet's avatar
      arm64: meson: select COMMON_CLK · 4cce442f
      Jerome Brunet authored
      
      This fix the recent removal of clock drivers selection.
      While it is not necessary to select the clock drivers themselves, we need
      to select a proper implementation of the clock API, which for the meson, is
      CCF
      
      Fixes: ba66a255 ("arm64: meson: ship only the necessary clock controllers")
      Reviewed-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Signed-off-by: default avatarJerome Brunet <jbrunet@baylibre.com>
      Reviewed-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarKevin Hilman <khilman@baylibre.com>
      Signed-off-by: default avatarNeil Armstrong <narmstrong@baylibre.com>
      Link: https://lore.kernel.org/r/20210429083823.59546-1-jbrunet@baylibre.com
      4cce442f
  13. May 29, 2021
    • Khem Raj's avatar
      riscv: Use -mno-relax when using lld linker · ec3a5cb6
      Khem Raj authored
      
      lld does not implement the RISCV relaxation optimizations like GNU ld
      therefore disable it when building with lld, Also pass it to
      assembler when using external GNU assembler ( LLVM_IAS != 1 ), this
      ensures that relevant assembler option is also enabled along. if these
      options are not used then we see following relocations in objects
      
      0000000000000000 R_RISCV_ALIGN     *ABS*+0x0000000000000002
      
      These are then rejected by lld
      ld.lld: error: capability.c:(.fixup+0x0): relocation R_RISCV_ALIGN requires unimplemented linker relaxation; recompile with -mno-relax but the .o is already compiled with -mno-relax
      
      Signed-off-by: default avatarKhem Raj <raj.khem@gmail.com>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      ec3a5cb6
    • Thomas Gleixner's avatar
      x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing · 7d65f9e8
      Thomas Gleixner authored
      
      PIC interrupts do not support affinity setting and they can end up on
      any online CPU. Therefore, it's required to mark the associated vectors
      as system-wide reserved. Otherwise, the corresponding irq descriptors
      are copied to the secondary CPUs but the vectors are not marked as
      assigned or reserved. This works correctly for the IO/APIC case.
      
      When the IO/APIC is disabled via config, kernel command line or lack of
      enumeration then all legacy interrupts are routed through the PIC, but
      nothing marks them as system-wide reserved vectors.
      
      As a consequence, a subsequent allocation on a secondary CPU can result in
      allocating one of these vectors, which triggers the BUG() in
      apic_update_vector() because the interrupt descriptor slot is not empty.
      
      Imran tried to work around that by marking those interrupts as allocated
      when a CPU comes online. But that's wrong in case that the IO/APIC is
      available and one of the legacy interrupts, e.g. IRQ0, has been switched to
      PIC mode because then marking them as allocated will fail as they are
      already marked as system vectors.
      
      Stay consistent and update the legacy vectors after attempting IO/APIC
      initialization and mark them as system vectors in case that no IO/APIC is
      available.
      
      Fixes: 69cde000 ("x86/vector: Use matrix allocator for vector assignment")
      Reported-by: default avatarImran Khan <imran.f.khan@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
      7d65f9e8
  14. May 28, 2021
Loading