- Dec 11, 2020
-
-
Giovanni Gherdovich authored
The value freq_max/freq_base is a fundamental component of frequency invariance calculations. It may come from a variety of sources such as MSRs or ACPI data, tracking it down when troubleshooting a system could be non-trivial. It is worth saving it in the kernel logs. # dmesg | grep 'Estimated ratio of average max' [ 14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289 Signed-off-by:
Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
-
Giovanni Gherdovich authored
Frequency invariant accounting calculations need the ratio freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power allocation between cores: AMD EPYC CPUs implement "Core Performance Boost". Three candidates are considered to estimate this value: - maximum non-boost frequency - maximum boost frequency - the mid point between the above two Experimental data on an AMD EPYC Zen2 machine slightly favors the third option, which is applied with this patch. The analysis uses the ondemand cpufreq governor as baseline, and compares it with schedutil in a number of configurations. Using the freq_max value described above offers a moderate advantage in performance and efficiency: sugov-max (freq_max=max_boost) performs the worst on tbench: less throughput and reduced efficiency than the other invariant-schedutil options (see "Data Overview" below). Consider that tbench is generally a problematic case as no schedutil version currently is better than ondemand. sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's can surpass ondemand with less filesystem latency and slightly increased efficiency. 1. DATA OVERVIEW 2. DETAILED PERFORMANCE TABLES 3. POWER CONSUMPTION TABLE 1. DATA OVERVIEW ================ sugov-noinv : non-invariant schedutil governor sugov-max : invariant schedutil, freq_max=max_boost sugov-mid : invariant schedutil, freq_max=midpoint sugov-P0 : invariant schedutil, freq_max=max_P perfgov : performance governor driver : acpi_cpufreq machine : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket, 128 cores / 256 threads, SATA SSD storage, 250G of memory, XFS filesystem Benchmarks are described in the next section. Tilde (~) means the value is the same as baseline. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ondemand perfgov sugov-noinv sugov-max sugov-mid sugov-P0 better if - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PERFORMANCE RATIOS tbench 1.00 1.44 0.90 0.87 0.93 0.93 higher dbench 1.00 0.91 0.95 0.94 0.94 1.06 lower kernbench 1.00 0.93 ~ ~ ~ 0.97 lower gitsource 1.00 0.66 0.97 0.96 ~ 0.95 lower - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PERFORMANCE-PER-WATT RATIOS tbench 1.00 1.16 0.84 0.84 0.88 0.85 higher dbench 1.00 1.03 1.02 1.02 1.02 0.93 higher kernbench 1.00 1.05 ~ ~ ~ ~ higher gitsource 1.00 1.46 1.04 1.04 ~ 1.05 higher 2. DETAILED PERFORMANCE TABLES ============================== Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 427.19 +- 0.16% ( ) 778.35 +- 0.10% ( 82.20%) 346.92 +- 0.14% ( -18.79%) Hmean 2 853.82 +- 0.09% ( ) 1536.23 +- 0.03% ( 79.93%) 694.36 +- 0.05% ( -18.68%) Hmean 4 1657.54 +- 0.12% ( ) 2938.18 +- 0.12% ( 77.26%) 1362.81 +- 0.11% ( -17.78%) Hmean 8 3301.87 +- 0.06% ( ) 5679.10 +- 0.04% ( 72.00%) 2693.35 +- 0.04% ( -18.43%) Hmean 16 6139.65 +- 0.05% ( ) 9498.81 +- 0.04% ( 54.71%) 4889.97 +- 0.17% ( -20.35%) Hmean 32 11170.28 +- 0.09% ( ) 17393.25 +- 0.08% ( 55.71%) 9104.55 +- 0.09% ( -18.49%) Hmean 64 19322.97 +- 0.17% ( ) 31573.91 +- 0.08% ( 63.40%) 18552.52 +- 0.40% ( -3.99%) Hmean 128 30383.71 +- 0.11% ( ) 37416.91 +- 0.15% ( 23.15%) 25938.70 +- 0.41% ( -14.63%) Hmean 256 31143.96 +- 0.41% ( ) 30908.76 +- 0.88% ( -0.76%) 29754.32 +- 0.24% ( -4.46%) Hmean 512 30858.49 +- 0.26% ( ) 38524.60 +- 1.19% ( 24.84%) 42080.39 +- 0.56% ( 36.37%) Hmean 1024 39187.37 +- 0.19% ( ) 36213.86 +- 0.26% ( -7.59%) 39555.98 +- 0.12% ( 0.94%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 352.59 +- 1.03% ( -17.46%) 352.08 +- 0.75% ( -17.58%) 352.31 +- 1.48% ( -17.53%) Hmean 2 697.32 +- 0.08% ( -18.33%) 700.16 +- 0.20% ( -18.00%) 696.79 +- 0.06% ( -18.39%) Hmean 4 1369.88 +- 0.04% ( -17.35%) 1369.72 +- 0.07% ( -17.36%) 1365.91 +- 0.05% ( -17.59%) Hmean 8 2696.79 +- 0.04% ( -18.33%) 2711.06 +- 0.04% ( -17.89%) 2715.10 +- 0.61% ( -17.77%) Hmean 16 4725.03 +- 0.03% ( -23.04%) 4875.65 +- 0.02% ( -20.59%) 4953.05 +- 0.28% ( -19.33%) Hmean 32 9231.65 +- 0.10% ( -17.36%) 8704.89 +- 0.27% ( -22.07%) 10562.02 +- 0.36% ( -5.45%) Hmean 64 15364.27 +- 0.19% ( -20.49%) 17786.64 +- 0.15% ( -7.95%) 19665.40 +- 0.22% ( 1.77%) Hmean 128 42100.58 +- 0.13% ( 38.56%) 34946.28 +- 0.13% ( 15.02%) 38635.79 +- 0.06% ( 27.16%) Hmean 256 30660.23 +- 1.08% ( -1.55%) 32307.67 +- 0.54% ( 3.74%) 31153.27 +- 0.12% ( 0.03%) Hmean 512 24604.32 +- 0.14% ( -20.27%) 40408.50 +- 1.10% ( 30.95%) 38800.29 +- 1.23% ( 25.74%) Hmean 1024 35535.47 +- 0.28% ( -9.32%) 41070.38 +- 2.56% ( 4.81%) 31308.29 +- 2.52% ( -20.11%) Benchmark : dbench (filesystem stressor) Varying parameter : number of clients Unit : seconds (lower is better) NOTE-1: This dbench version measures the average latency of a set of filesystem operations, as we found the traditional dbench metric (throughput) to be misleading. NOTE-2: Due to high variability, we partition the original dataset and apply statistical bootrapping (a resampling method). Accuracy is reported in the form of 95% confidence intervals. 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - SubAmean 1 98.79 +- 0.92 ( ) 83.36 +- 0.82 ( 15.62%) 84.82 +- 0.92 ( 14.14%) SubAmean 2 116.00 +- 0.89 ( ) 102.12 +- 0.77 ( 11.96%) 109.63 +- 0.89 ( 5.49%) SubAmean 4 149.90 +- 1.03 ( ) 132.12 +- 0.91 ( 11.86%) 143.90 +- 1.15 ( 4.00%) SubAmean 8 182.41 +- 1.13 ( ) 159.86 +- 0.93 ( 12.36%) 165.82 +- 1.03 ( 9.10%) SubAmean 16 237.83 +- 1.23 ( ) 219.46 +- 1.14 ( 7.72%) 229.28 +- 1.19 ( 3.59%) SubAmean 32 334.34 +- 1.49 ( ) 309.94 +- 1.42 ( 7.30%) 321.19 +- 1.36 ( 3.93%) SubAmean 64 576.61 +- 2.16 ( ) 540.75 +- 2.00 ( 6.22%) 551.27 +- 1.99 ( 4.39%) SubAmean 128 1350.07 +- 4.14 ( ) 1205.47 +- 3.20 ( 10.71%) 1280.26 +- 3.75 ( 5.17%) SubAmean 256 3444.42 +- 7.97 ( ) 3698.00 +- 27.43 ( -7.36%) 3494.14 +- 7.81 ( -1.44%) SubAmean 2048 39457.89 +- 29.01 ( ) 34105.33 +- 41.85 ( 13.57%) 39688.52 +- 36.26 ( -0.58%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - SubAmean 1 85.68 +- 1.04 ( 13.27%) 84.16 +- 0.84 ( 14.81%) 83.99 +- 0.90 ( 14.99%) SubAmean 2 108.42 +- 0.95 ( 6.54%) 109.91 +- 1.39 ( 5.24%) 112.06 +- 0.91 ( 3.39%) SubAmean 4 136.90 +- 1.04 ( 8.67%) 137.59 +- 0.93 ( 8.21%) 136.55 +- 0.95 ( 8.91%) SubAmean 8 163.15 +- 0.96 ( 10.56%) 166.07 +- 1.02 ( 8.96%) 165.81 +- 0.99 ( 9.10%) SubAmean 16 224.86 +- 1.12 ( 5.45%) 223.83 +- 1.06 ( 5.89%) 230.66 +- 1.19 ( 3.01%) SubAmean 32 320.51 +- 1.38 ( 4.13%) 322.85 +- 1.49 ( 3.44%) 321.96 +- 1.46 ( 3.70%) SubAmean 64 553.25 +- 1.93 ( 4.05%) 554.19 +- 2.08 ( 3.89%) 562.26 +- 2.22 ( 2.49%) SubAmean 128 1264.35 +- 3.72 ( 6.35%) 1256.99 +- 3.46 ( 6.89%) 2018.97 +- 18.79 ( -49.55%) SubAmean 256 3466.25 +- 8.25 ( -0.63%) 3450.58 +- 8.44 ( -0.18%) 5032.12 +- 38.74 ( -46.09%) SubAmean 2048 39133.10 +- 45.71 ( 0.82%) 39905.95 +- 34.33 ( -1.14%) 53811.86 +-193.04 ( -36.38%) Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 471.71 +- 26.61% ( ) 409.88 +- 16.99% ( 13.11%) 430.63 +- 0.18% ( 8.71%) Amean 4 211.87 +- 0.58% ( ) 194.03 +- 0.74% ( 8.42%) 215.33 +- 0.64% ( -1.63%) Amean 8 109.79 +- 1.27% ( ) 101.43 +- 1.53% ( 7.61%) 111.05 +- 1.95% ( -1.15%) Amean 16 59.50 +- 1.28% ( ) 55.61 +- 1.35% ( 6.55%) 59.65 +- 1.78% ( -0.24%) Amean 32 34.94 +- 1.22% ( ) 32.36 +- 1.95% ( 7.41%) 35.44 +- 0.63% ( -1.43%) Amean 64 22.58 +- 0.38% ( ) 20.97 +- 1.28% ( 7.11%) 22.41 +- 1.73% ( 0.74%) Amean 128 17.72 +- 0.44% ( ) 16.68 +- 0.32% ( 5.88%) 17.65 +- 0.96% ( 0.37%) Amean 256 16.44 +- 0.53% ( ) 15.76 +- 0.32% ( 4.18%) 16.76 +- 0.60% ( -1.93%) Amean 512 16.54 +- 0.21% ( ) 15.62 +- 0.41% ( 5.53%) 16.84 +- 0.85% ( -1.83%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 421.30 +- 0.24% ( 10.69%) 419.26 +- 0.15% ( 11.12%) 414.38 +- 0.33% ( 12.15%) Amean 4 217.81 +- 5.53% ( -2.80%) 211.63 +- 0.99% ( 0.12%) 208.43 +- 0.47% ( 1.63%) Amean 8 108.80 +- 0.43% ( 0.90%) 108.48 +- 1.44% ( 1.19%) 108.59 +- 3.08% ( 1.09%) Amean 16 58.84 +- 0.74% ( 1.12%) 58.37 +- 0.94% ( 1.91%) 57.78 +- 0.78% ( 2.90%) Amean 32 34.04 +- 2.00% ( 2.59%) 34.28 +- 1.18% ( 1.91%) 33.98 +- 2.21% ( 2.75%) Amean 64 22.22 +- 1.69% ( 1.60%) 22.27 +- 1.60% ( 1.38%) 22.25 +- 1.41% ( 1.47%) Amean 128 17.55 +- 0.24% ( 0.97%) 17.53 +- 0.94% ( 1.04%) 17.49 +- 0.43% ( 1.30%) Amean 256 16.51 +- 0.46% ( -0.40%) 16.48 +- 0.48% ( -0.19%) 16.44 +- 1.21% ( 0.00%) Amean 512 16.50 +- 0.35% ( 0.19%) 16.35 +- 0.42% ( 1.14%) 16.37 +- 0.33% ( 0.99%) Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.9.0-ondemand (BASELINE) 5.9.0-perfgov 5.9.0-sugov-noinv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 1035.76 +- 0.30% ( ) 688.21 +- 0.04% ( 33.56%) 1003.85 +- 0.14% ( 3.08%) 5.9.0-sugov-max 5.9.0-sugov-mid 5.9.0-sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 995.82 +- 0.08% ( 3.86%) 1011.98 +- 0.03% ( 2.30%) 986.87 +- 0.19% ( 4.72%) 3. POWER CONSUMPTION TABLE ========================== Average power consumption (watts). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ondemand perfgov sugov-noinv sugov-max sugov-mid sugov-P0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - tbench4 227.25 281.83 244.17 236.76 241.50 247.99 dbench4 151.97 161.87 157.08 158.10 158.06 153.73 kernbench 162.78 167.22 162.90 164.19 164.65 164.72 gitsource 133.65 139.00 133.04 134.43 134.18 134.32 Signed-off-by:
Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
-
Nathan Fontenot authored
This is the first pass in creating the ability to calculate the frequency invariance on AMD systems. This approach uses the CPPC highest performance and nominal performance values that range from 0 - 255 instead of a high and base frquency. This is because we do not have the ability on AMD to get a highest frequency value. On AMD systems the highest performance and nominal performance vaues do correspond to the highest and base frequencies for the system so using them should produce an appropriate ratio but some tweaking is likely necessary. Due to CPPC being initialized later in boot than when the frequency invariant calculation is currently made, I had to create a callback from the CPPC init code to do the calculation after we have CPPC data. Special thanks to "kernel test robot <lkp@intel.com>" for reporting that compilation of drivers/acpi/cppc_acpi.c is conditional to CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI. [ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ] Signed-off-by:
Nathan Fontenot <nathan.fontenot@amd.com> Signed-off-by:
Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
-
- Nov 24, 2020
-
-
Peter Zijlstra authored
Get rid of the __call_single_node union and cleanup the API a little to avoid external code relying on the structure layout as much. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Frederic Weisbecker <frederic@kernel.org>
-
- Nov 23, 2020
-
-
Sven Schnelle authored
We need to disable interrupts in load_fpu_regs(). Otherwise an interrupt might come in after the registers are loaded, but before CIF_FPU is cleared in load_fpu_regs(). When the interrupt returns, CIF_FPU will be cleared and the registers will never be restored. The entry.S code usually saves the interrupt state in __SF_EMPTY on the stack when disabling/restoring interrupts. sie64a however saves the pointer to the sie control block in __SF_SIE_CONTROL, which references the same location. This is non-obvious to the reader. To avoid thrashing the sie control block pointer in load_fpu_regs(), move the __SIE_* offsets eight bytes after __SF_EMPTY on the stack. Cc: <stable@vger.kernel.org> # 5.8 Fixes: 0b0ed657 ("s390: remove critical section cleanup from entry.S") Reported-by:
Pierre Morel <pmorel@linux.ibm.com> Signed-off-by:
Sven Schnelle <svens@linux.ibm.com> Acked-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
Heiko Carstens <hca@linux.ibm.com> Signed-off-by:
Heiko Carstens <hca@linux.ibm.com>
-
- Nov 22, 2020
-
-
Dan Williams authored
The core-mm has a default __weak implementation of phys_to_target_node() to mirror the weak definition of memory_add_physaddr_to_nid(). That symbol is exported for modules. However, while the export in mm/memory_hotplug.c exported the symbol in the configuration cases of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=y ...and: CONFIG_NUMA_KEEP_MEMINFO=n CONFIG_MEMORY_HOTPLUG=y ...it failed to export the symbol in the case of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=n Not only is that broken, but Christoph points out that the kernel should not be exporting any __weak symbol, which means that memory_add_physaddr_to_nid() example that phys_to_target_node() copied is broken too. Rework the definition of phys_to_target_node() and memory_add_physaddr_to_nid() to not require weak symbols. Move to the common arch override design-pattern of an asm header defining a symbol to replace the default implementation. The only common header that all memory_add_physaddr_to_nid() producing architectures implement is asm/sparsemem.h. In fact, powerpc already defines its memory_add_physaddr_to_nid() helper in sparsemem.h. Double-down on that observation and define phys_to_target_node() where necessary in asm/sparsemem.h. An alternate consideration that was discarded was to put this override in asm/numa.h, but that entangles with the definition of MAX_NUMNODES relative to the inclusion of linux/nodemask.h, and requires powerpc to grow a new header. The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid now that the symbol is properly exported / stubbed in all combinations of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG. [dan.j.williams@intel.com: v4] Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning] Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: a035b6bf ("mm/memory_hotplug: introduce default phys_to_target_node() implementation") Reported-by:
Randy Dunlap <rdunlap@infradead.org> Reported-by:
Thomas Gleixner <tglx@linutronix.de> Reported-by:
kernel test robot <lkp@intel.com> Reported-by:
Christoph Hellwig <hch@infradead.org> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Tested-by:
Randy Dunlap <rdunlap@infradead.org> Tested-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Nov 19, 2020
-
-
Daniel Axtens authored
pseries|pnv_setup_rfi_flush already does the count cache flush setup, and we just added entry and uaccess flushes. So the name is not very accurate any more. In both platforms we then also immediately setup the STF flush. Rename them to _setup_security_mitigations and fold the STF flush in. Signed-off-by:
Daniel Axtens <dja@axtens.net> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
In kup.h we currently include kup-radix.h for all 64-bit builds, which includes Book3S and Book3E. The latter doesn't make sense, Book3E never uses the Radix MMU. This has worked up until now, but almost by accident, and the recent uaccess flush changes introduced a build breakage on Book3E because of the bad structure of the code. So disentangle things so that we only use kup-radix.h for Book3S. This requires some more stubs in kup.h and fixing an include in syscall_64.c. Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
IBM Power9 processors can speculatively operate on data in the L1 cache before it has been completely validated, via a way-prediction mechanism. It is not possible for an attacker to determine the contents of impermissible memory using this method, since these systems implement a combination of hardware and software security measures to prevent scenarios where protected data could be leaked. However these measures don't address the scenario where an attacker induces the operating system to speculatively execute instructions using data that the attacker controls. This can be used for example to speculatively bypass "kernel user access prevention" techniques, as discovered by Anthony Steinhauser of Google's Safeside Project. This is not an attack by itself, but there is a possibility it could be used in conjunction with side-channels or other weaknesses in the privileged code to construct an attack. This issue can be mitigated by flushing the L1 cache between privilege boundaries of concern. This patch flushes the L1 cache after user accesses. This is part of the fix for CVE-2020-4788. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Daniel Axtens <dja@axtens.net> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
IBM Power9 processors can speculatively operate on data in the L1 cache before it has been completely validated, via a way-prediction mechanism. It is not possible for an attacker to determine the contents of impermissible memory using this method, since these systems implement a combination of hardware and software security measures to prevent scenarios where protected data could be leaked. However these measures don't address the scenario where an attacker induces the operating system to speculatively execute instructions using data that the attacker controls. This can be used for example to speculatively bypass "kernel user access prevention" techniques, as discovered by Anthony Steinhauser of Google's Safeside Project. This is not an attack by itself, but there is a possibility it could be used in conjunction with side-channels or other weaknesses in the privileged code to construct an attack. This issue can be mitigated by flushing the L1 cache between privilege boundaries of concern. This patch flushes the L1 cache on kernel entry. This is part of the fix for CVE-2020-4788. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Daniel Axtens <dja@axtens.net> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Ionela Voinescu authored
Task scheduler behavior depends on frequency invariance (FI) support and the resulting invariant load tracking signals. For example, in order to make accurate predictions across CPUs for all performance states, Energy Aware Scheduling (EAS) needs frequency-invariant load tracking signals and therefore it has a direct dependency on FI. This dependency is known, but EAS enablement is not yet conditioned on the presence of FI during the built of the scheduling domain hierarchy. Before this is done, the following must be considered: while arch_scale_freq_invariant() will see changes in FI support and could be used to condition the use of EAS, it could return different values during system initialisation. For arm64, such a scenario will happen for a system that does not support cpufreq driven FI, but does support counter-driven FI. For such a system, arch_scale_freq_invariant() will return false if called before counter based FI initialisation, but change its status to true after it. If EAS becomes explicitly dependent on FI this would affect the task scheduler behavior which builds its scheduling domain hierarchy well before the late counter-based FI init. During that process, EAS would be disabled due to its dependency on FI. Two points of future early calls to arch_scale_freq_invariant() which would determine EAS enablement are: - (1) drivers/base/arch_topology.c:126 <<update_topology_flags_workfn>> rebuild_sched_domains(); This will happen after CPU capacity initialisation. - (2) kernel/sched/cpufreq_schedutil.c:917 <<rebuild_sd_workfn>> rebuild_sched_domains_energy(); -->rebuild_sched_domains(); This will happen during sched_cpufreq_governor_change() for the schedutil cpufreq governor. Therefore, before enforcing the presence of FI support for the use of EAS, ensure the following: if there is a change in FI support status after counter init, use the existing rebuild_sched_domains_energy() function to trigger a rebuild of the scheduling and performance domains that in turn will determine the enablement of EAS. Signed-off-by:
Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Catalin Marinas <catalin.marinas@arm.com> Link: https://lkml.kernel.org/r/20201027180713.7642-3-ionela.voinescu@arm.com
-
- Nov 18, 2020
-
-
Zhenzhong Duan authored
"intel_iommu=off" command line is used to disable iommu but iommu is force enabled in a tboot system for security reason. However for better performance on high speed network device, a new option "intel_iommu=tboot_noforce" is introduced to disable the force on. By default kernel should panic if iommu init fail in tboot for security reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off". Fix the code setting force_on and move intel_iommu_tboot_noforce from tboot code to intel iommu code. Fixes: 7304e8f2 ("iommu/vt-d: Correctly disable Intel IOMMU force on") Signed-off-by:
Zhenzhong Duan <zhenzhong.duan@gmail.com> Tested-by:
Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com> Acked-by:
Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20201110071908.3133-1-zhenzhong.duan@gmail.com Signed-off-by:
Will Deacon <will@kernel.org>
-
Thomas Gleixner authored
sysrq-t ends up invoking show_opcodes() for each task which tries to access the user space code of other processes, which is obviously bogus. It either manages to dump where the foreign task's regs->ip points to in a valid mapping of the current task or triggers a pagefault and prints "Code: Bad RIP value.". Both is just wrong. Add a safeguard in copy_code() and check whether the @regs pointer matches currents pt_regs. If not, do not even try to access it. While at it, add commentary why using copy_from_user_nmi() is safe in copy_code() even if the function name suggests otherwise. Reported-by:
Oleg Nesterov <oleg@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Borislav Petkov <bp@suse.de> Acked-by:
Oleg Nesterov <oleg@redhat.com> Tested-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201117202753.667274723@linutronix.de
-
Vineet Gupta authored
This is a non-functional change, if anything a better fall-back handling. Signed-off-by:
Vineet Gupta <vgupta@synopsys.com>
-
Vineet Gupta authored
To start stack unwinding (SP, PC and BLINK) are needed. When the explicit execution context (pt_regs etc) is not available, unwinder assumes the task is sleeping (in __switch_to()) and fetches SP and BLINK from kernel mode stack. But this assumption is not true, specially in a SMP system, when top runs on 1 core, there may be active running processes on all cores. So when unwinding non courrent tasks, ensure they are NOT running. And while at it, handle the self unwinding case explicitly. This came out of investigation of a customer reported hang with rcutorture+top Link: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/31 Signed-off-by:
Vineet Gupta <vgupta@synopsys.com>
-
Flavio Suligoi authored
Signed-off-by:
Flavio Suligoi <f.suligoi@asem.it> Signed-off-by:
Vineet Gupta <vgupta@synopsys.com>
-
Gustavo Pimentel authored
The 1-bit shift rotation to the left on x variable located on 4 last if statement can be removed because the computed value is will not be used afront. Signed-off-by:
Gustavo Pimentel <gustavo.pimentel@synopsys.com> Signed-off-by:
Vineet Gupta <vgupta@synopsys.com>
-
- Nov 17, 2020
-
-
Laurent Pinchart authored
When adding __user annotations in commit 2adf5352, the strncpy_from_user() function declaration for the CONFIG_GENERIC_STRNCPY_FROM_USER case was missed. Fix it. Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Laurent Pinchart <laurent.pinchart@ideasonboard.com> Message-Id: <20200831210937.17938-1-laurent.pinchart@ideasonboard.com> Signed-off-by:
Max Filippov <jcmvbkbc@gmail.com>
-
Sami Tolvanen authored
This change switches rapl to use PMU_FORMAT_ATTR, and fixes two other macros to use device_attribute instead of kobj_attribute to avoid callback type mismatches that trip indirect call checking with Clang's Control-Flow Integrity (CFI). Reported-by:
Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by:
Sami Tolvanen <samitolvanen@google.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20201113183126.1239404-1-samitolvanen@google.com
-
Zhang Qilong authored
If the clk_register fails, we should free h before function returns to prevent memleak. Fixes: 47440229 ("MIPS: Alchemy: clock framework integration of onchip clocks") Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Zhang Qilong <zhangqilong3@huawei.com> Signed-off-by:
Thomas Bogendoerfer <tsbogend@alpha.franken.de>
-
Chen Yu authored
Currently, scan_microcode() leverages microcode_matches() to check if the microcode matches the CPU by comparing the family and model. However, the processor stepping and flags of the microcode signature should also be considered when saving a microcode patch for early update. Use find_matching_signature() in scan_microcode() and get rid of the now-unused microcode_matches() which is a good cleanup in itself. Complete the verification of the patch being saved for early loading in save_microcode_patch() directly. This needs to be done there too because save_mc_for_early() will call save_microcode_patch() too. The second reason why this needs to be done is because the loader still tries to support, at least hypothetically, mixed-steppings systems and thus adds all patches to the cache that belong to the same CPU model albeit with different steppings. For example: microcode: CPU: sig=0x906ec, pf=0x2, rev=0xd6 microcode: mc_saved[0]: sig=0x906e9, pf=0x2a, rev=0xd6, total size=0x19400, date = 2020-04-23 microcode: mc_saved[1]: sig=0x906ea, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27 microcode: mc_saved[2]: sig=0x906eb, pf=0x2, rev=0xd6, total size=0x19400, date = 2020-04-23 microcode: mc_saved[3]: sig=0x906ec, pf=0x22, rev=0xd6, total size=0x19000, date = 2020-04-27 microcode: mc_saved[4]: sig=0x906ed, pf=0x22, rev=0xd6, total size=0x19400, date = 2020-04-23 The patch which is being saved for early loading, however, can only be the one which fits the CPU this runs on so do the signature verification before saving. [ bp: Do signature verification in save_microcode_patch() and rewrite commit message. ] Fixes: ec400dde ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU") Signed-off-by:
Chen Yu <yu.c.chen@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=208535 Link: https://lkml.kernel.org/r/20201113015923.13960-1-yu.c.chen@intel.com
-
Thomas Bogendoerfer authored
The loop over all memblocks works with PFNs and not physical addresses, so we need for_each_mem_pfn_range(). Fixes: b10d6bca ("arch, drivers: replace for_each_membock() with for_each_mem_range()") Signed-off-by:
Thomas Bogendoerfer <tsbogend@alpha.franken.de> Reviewed-by:
Mike Rapoport <rppt@linux.ibm.com> Reviewed-by:
Serge Semin <fancer.lancer@gmail.com>
-
- Nov 16, 2020
-
-
Max Filippov authored
Although cache alias management calls set up and tear down TLB entries and fast_second_level_miss is able to restore TLB entry should it be evicted they absolutely cannot preempt each other because they use the same TLBTEMP area for different purposes. Disable preemption around all cache alias management calls to enforce that. Cc: stable@vger.kernel.org Signed-off-by:
Max Filippov <jcmvbkbc@gmail.com>
-
Max Filippov authored
fast_second_level_miss handler for the TLBTEMP area has an assumption that page table directory entry for the TLBTEMP address range is 0. For it to be true the TLBTEMP area must be aligned to 4MB boundary and not share its 4MB region with anything that may use a page table. This is not true currently: TLBTEMP shares space with vmalloc space which results in the following kinds of runtime errors when fast_second_level_miss loads page table directory entry for the vmalloc space instead of fixing up the TLBTEMP area: Unable to handle kernel paging request at virtual address c7ff0e00 pc = d0009275, ra = 90009478 Oops: sig: 9 [#1] PREEMPT CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty #58 Workqueue: xprtiod xs_stream_data_receive_workfn a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000 a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90 pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00 lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010 Call Trace: xs_stream_data_receive_workfn+0x43c/0x770 process_one_work+0x1a1/0x324 worker_thread+0x1cc/0x3c0 kthread+0x10d/0x124 ret_from_kernel_thread+0xc/0x18 Cc: stable@vger.kernel.org Signed-off-by:
Max Filippov <jcmvbkbc@gmail.com>
-
- Nov 15, 2020
-
-
Paolo Bonzini authored
In some cases where shadow paging is in use, the root page will be either mmu->pae_root or vcpu->arch.mmu->lm_root. Then it will not have an associated struct kvm_mmu_page, because it is allocated with alloc_page instead of kvm_mmu_alloc_page. Just return false quickly from is_tdp_mmu_root if the TDP MMU is not in use, which also includes the case where shadow paging is enabled. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 13, 2020
-
-
Krzysztof Kozlowski authored
This reverts commit eaf2d2f6. The commit eaf2d2f6 ("ARM: dts: exynos: add input clock to CMU in Exynos4412 Odroid") breaks probing of usb3503 USB hub on Odroid U3. It changes the order of clock drivers probe: the clkout (Exynos PMU) driver is probed before the main clk-exynos4 driver. The clkout driver on Exynos4412 depends on clk-exynos4 but it does not support deferred probe, therefore this dependency and changed probe order causes probe failure. The usb3503 USB hub on Odroid U3 on the other hand requires clkout clock. This can be seen in logs: [ 5.007442] usb3503 0-0008: unable to request refclk (-517) Reported-by:
Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by:
Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20200921174818.15525-1-krzk@kernel.org Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
Babu Moger authored
For AMD SEV guests, update the cr3_lm_rsvd_bits to mask the memory encryption bit in reserved bits. Signed-off-by:
Babu Moger <babu.moger@amd.com> Message-Id: <160521948301.32054.5783800787423231162.stgit@bmoger-ubuntu> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Babu Moger authored
SEV guests fail to boot on a system that supports the PCID feature. While emulating the RSM instruction, KVM reads the guest CR3 and calls kvm_set_cr3(). If the vCPU is in the long mode, kvm_set_cr3() does a sanity check for the CR3 value. In this case, it validates whether the value has any reserved bits set. The reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory encryption is enabled, the memory encryption bit is set in the CR3 value. The memory encryption bit may fall within the KVM reserved bit range, causing the KVM emulation failure. Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will cache the reserved bits in the CR3 value. This will be initialized to rsvd_bits(cpuid_maxphyaddr(vcpu), 63). If the architecture has any special bits(like AMD SEV encryption bit) that needs to be masked from the reserved bits, should be cleared in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler. Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3") Signed-off-by:
Babu Moger <babu.moger@amd.com> Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Edmondson authored
The instruction emulator ignores clflush instructions, yet fails to support clflushopt. Treat both similarly. Fixes: 13e457e0 ("KVM: x86: Emulator does not decode clflush well") Signed-off-by:
David Edmondson <david.edmondson@oracle.com> Message-Id: <20201103120400.240882-1-david.edmondson@oracle.com> Reviewed-by:
Joao Martins <joao.m.martins@oracle.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Konrad Dybcio authored
QCOM KRYO2XX Silver cores are Cortex-A53 based and are susceptible to the 845719 erratum. Add them to the lookup list to apply the erratum. Signed-off-by:
Konrad Dybcio <konrad.dybcio@somainline.org> Link: https://lore.kernel.org/r/20201104232218.198800-5-konrad.dybcio@somainline.org Signed-off-by:
Will Deacon <will@kernel.org>
-
Konrad Dybcio authored
KRYO2XX silver (LITTLE) CPUs are based on Cortex-A53 and they are not affected by spectre-v2. Signed-off-by:
Konrad Dybcio <konrad.dybcio@somainline.org> Link: https://lore.kernel.org/r/20201104232218.198800-4-konrad.dybcio@somainline.org Signed-off-by:
Will Deacon <will@kernel.org>
-
Konrad Dybcio authored
QCOM KRYO2XX gold (big) silver (LITTLE) CPU cores are based on Cortex-A73 and Cortex-A53 respectively and are meltdown safe, hence add them to kpti_safe_list[]. Signed-off-by:
Konrad Dybcio <konrad.dybcio@somainline.org> Link: https://lore.kernel.org/r/20201104232218.198800-3-konrad.dybcio@somainline.org Signed-off-by:
Will Deacon <will@kernel.org>
-
Konrad Dybcio authored
Add MIDR value for KRYO2XX gold (big) and silver (LITTLE) CPU cores which are used in Qualcomm Technologies, Inc. SoCs. This will be used to identify and apply errata which are applicable for these CPU cores. Signed-off-by:
Konrad Dybcio <konrad.dybcio@somainline.org> Link: https://lore.kernel.org/r/20201104232218.198800-2-konrad.dybcio@somainline.org Signed-off-by:
Will Deacon <will@kernel.org>
-
Anshuman Khandual authored
During memory hotplug process, the linear mapping should not be created for a given memory range if that would fall outside the maximum allowed linear range. Else it might cause memory corruption in the kernel virtual space. Maximum linear mapping region is [PAGE_OFFSET..(PAGE_END -1)] accommodating both its ends but excluding PAGE_END. Max physical range that can be mapped inside this linear mapping range, must also be derived from its end points. This ensures that arch_add_memory() validates memory hot add range for its potential linear mapping requirements, before creating it with __create_pgd_mapping(). Fixes: 4ab21506 ("arm64: Add memory hotplug support") Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by:
Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Link: https://lore.kernel.org/r/1605252614-761-1-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Will Deacon <will@kernel.org>
-
- Nov 12, 2020
-
-
Mike Travis authored
A test shows that the output contains a space: # cat /proc/sgi_uv/archtype NSGI4 U/UVX Remove that embedded space by copying the "trimmed" buffer instead of the untrimmed input character list. Use sizeof to remove size dependency on copy out length. Increase output buffer size by one character just in case BIOS sends an 8 character string for archtype. Fixes: 1e61f5a9 ("Add and decode Arch Type in UVsystab") Signed-off-by:
Mike Travis <mike.travis@hpe.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Steve Wahl <steve.wahl@hpe.com> Link: https://lore.kernel.org/r/20201111010418.82133-1-mike.travis@hpe.com
-
Marc Zyngier authored
As the kernel never sets HCR_EL2.EnSCXT, accesses to SCXTNUM_ELx will trap to EL2. Let's handle that as gracefully as possible by injecting an UNDEF exception into the guest. This is consistent with the guest's view of ID_AA64PFR0_EL1.CSV2 being at most 1. Signed-off-by:
Marc Zyngier <maz@kernel.org> Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20201110141308.451654-4-maz@kernel.org
-
Marc Zyngier authored
A large number of system register trap handlers only inject an UNDEF exeption, and yet each class of sysreg seems to provide its own, identical function. Let's unify them all, saving us introducing yet another one later. Signed-off-by:
Marc Zyngier <maz@kernel.org> Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20201110141308.451654-3-maz@kernel.org
-
Marc Zyngier authored
We now expose ID_AA64PFR0_EL1.CSV2=1 to guests running on hosts that are immune to Spectre-v2, but that don't have this field set, most likely because they predate the specification. However, this prevents the migration of guests that have started on a host the doesn't fake this CSV2 setting to one that does, as KVM rejects the write to ID_AA64PFR0_EL2 on the grounds that it isn't what is already there. In order to fix this, allow userspace to set this field as long as this doesn't result in a promising more than what is already there (setting CSV2 to 0 is acceptable, but setting it to 1 when it is already set to 0 isn't). Fixes: e1026237 ("KVM: arm64: Set CSV2 for guests on hardware unaffected by Spectre-v2") Reported-by:
Peng Liang <liangpeng10@huawei.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20201110141308.451654-2-maz@kernel.org
-
Thomas Richter authored
This file is installed by the s390 CPU Measurement sampling facility device driver to export supported minimum and maximum sample buffer sizes. This file is read by lscpumf tool to display the details of the device driver capabilities. The lscpumf tool might be invoked by a non-root user. In this case it does not print anything because the file contents can not be read. Fix this by allowing read access for all users. Reading the file contents is ok, changing the file contents is left to the root user only. For further reference and details see: [1] https://github.com/ibm-s390-tools/s390-tools/issues/97 Fixes: 69f239ed ("s390/cpum_sf: Dynamically extend the sampling buffer if overflows occur") Cc: <stable@vger.kernel.org> # 3.14 Signed-off-by:
Thomas Richter <tmricht@linux.ibm.com> Acked-by:
Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by:
Heiko Carstens <hca@linux.ibm.com>
-
Heiko Carstens authored
Signed-off-by:
Heiko Carstens <hca@linux.ibm.com>
-