Skip to content
Snippets Groups Projects
  1. Dec 11, 2020
    • Giovanni Gherdovich's avatar
      x86: Print ratio freq_max/freq_base used in frequency invariance calculations · 3149cd55
      Giovanni Gherdovich authored
      
      The value freq_max/freq_base is a fundamental component of frequency
      invariance calculations. It may come from a variety of sources such as MSRs
      or ACPI data, tracking it down when troubleshooting a system could be
      non-trivial. It is worth saving it in the kernel logs.
      
       # dmesg | grep 'Estimated ratio of average max'
       [   14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289
      
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
      3149cd55
    • Giovanni Gherdovich's avatar
      x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC · 976df7e5
      Giovanni Gherdovich authored
      
      Frequency invariant accounting calculations need the ratio
      freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power
      allocation between cores: AMD EPYC CPUs implement "Core Performance Boost".
      Three candidates are considered to estimate this value:
      
      - maximum non-boost frequency
      - maximum boost frequency
      - the mid point between the above two
      
      Experimental data on an AMD EPYC Zen2 machine slightly favors the third
      option, which is applied with this patch.
      
      The analysis uses the ondemand cpufreq governor as baseline, and compares
      it with schedutil in a number of configurations. Using the freq_max value
      described above offers a moderate advantage in performance and efficiency:
      
      sugov-max (freq_max=max_boost) performs the worst on tbench: less
      throughput and reduced efficiency than the other invariant-schedutil
      options (see "Data Overview" below). Consider that tbench is generally a
      problematic case as no schedutil version currently is better than ondemand.
      
      sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's
      can surpass ondemand with less filesystem latency and slightly increased
      efficiency.
      
      1. DATA OVERVIEW
      2. DETAILED PERFORMANCE TABLES
      3. POWER CONSUMPTION TABLE
      
      1. DATA OVERVIEW
      ================
      
      sugov-noinv : non-invariant schedutil governor
      sugov-max   : invariant schedutil, freq_max=max_boost
      sugov-mid   : invariant schedutil, freq_max=midpoint
      sugov-P0    : invariant schedutil, freq_max=max_P
      perfgov     : performance governor
      
      driver      : acpi_cpufreq
      machine     : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket,
                    128 cores / 256 threads, SATA SSD storage, 250G of memory,
      	      XFS filesystem
      
      Benchmarks are described in the next section.
      Tilde (~) means the value is the same as baseline.
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                  ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0  better if
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                              PERFORMANCE RATIOS
      tbench        1.00       1.44       0.90       0.87       0.93       0.93      higher
      dbench        1.00       0.91       0.95       0.94       0.94       1.06      lower
      kernbench     1.00       0.93       ~          ~          ~          0.97      lower
      gitsource     1.00       0.66       0.97       0.96       ~          0.95      lower
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                          PERFORMANCE-PER-WATT RATIOS
      tbench        1.00       1.16       0.84       0.84       0.88       0.85      higher
      dbench        1.00       1.03       1.02       1.02       1.02       0.93      higher
      kernbench     1.00       1.05       ~          ~          ~          ~         higher
      gitsource     1.00       1.46       1.04       1.04       ~          1.05      higher
      
      2. DETAILED PERFORMANCE TABLES
      ==============================
      
      Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
      Varying parameter  : number of clients
      Unit               : MB/sec (higher is better)
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        427.19  +- 0.16% (        )     778.35  +- 0.10% (  82.20%)     346.92  +- 0.14% ( -18.79%)
      Hmean  2        853.82  +- 0.09% (        )    1536.23  +- 0.03% (  79.93%)     694.36  +- 0.05% ( -18.68%)
      Hmean  4       1657.54  +- 0.12% (        )    2938.18  +- 0.12% (  77.26%)    1362.81  +- 0.11% ( -17.78%)
      Hmean  8       3301.87  +- 0.06% (        )    5679.10  +- 0.04% (  72.00%)    2693.35  +- 0.04% ( -18.43%)
      Hmean  16      6139.65  +- 0.05% (        )    9498.81  +- 0.04% (  54.71%)    4889.97  +- 0.17% ( -20.35%)
      Hmean  32     11170.28  +- 0.09% (        )   17393.25  +- 0.08% (  55.71%)    9104.55  +- 0.09% ( -18.49%)
      Hmean  64     19322.97  +- 0.17% (        )   31573.91  +- 0.08% (  63.40%)   18552.52  +- 0.40% (  -3.99%)
      Hmean  128    30383.71  +- 0.11% (        )   37416.91  +- 0.15% (  23.15%)   25938.70  +- 0.41% ( -14.63%)
      Hmean  256    31143.96  +- 0.41% (        )   30908.76  +- 0.88% (  -0.76%)   29754.32  +- 0.24% (  -4.46%)
      Hmean  512    30858.49  +- 0.26% (        )   38524.60  +- 1.19% (  24.84%)   42080.39  +- 0.56% (  36.37%)
      Hmean  1024   39187.37  +- 0.19% (        )   36213.86  +- 0.26% (  -7.59%)   39555.98  +- 0.12% (   0.94%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        352.59  +- 1.03% ( -17.46%)     352.08  +- 0.75% ( -17.58%)     352.31  +- 1.48% ( -17.53%)
      Hmean  2        697.32  +- 0.08% ( -18.33%)     700.16  +- 0.20% ( -18.00%)     696.79  +- 0.06% ( -18.39%)
      Hmean  4       1369.88  +- 0.04% ( -17.35%)    1369.72  +- 0.07% ( -17.36%)    1365.91  +- 0.05% ( -17.59%)
      Hmean  8       2696.79  +- 0.04% ( -18.33%)    2711.06  +- 0.04% ( -17.89%)    2715.10  +- 0.61% ( -17.77%)
      Hmean  16      4725.03  +- 0.03% ( -23.04%)    4875.65  +- 0.02% ( -20.59%)    4953.05  +- 0.28% ( -19.33%)
      Hmean  32      9231.65  +- 0.10% ( -17.36%)    8704.89  +- 0.27% ( -22.07%)   10562.02  +- 0.36% (  -5.45%)
      Hmean  64     15364.27  +- 0.19% ( -20.49%)   17786.64  +- 0.15% (  -7.95%)   19665.40  +- 0.22% (   1.77%)
      Hmean  128    42100.58  +- 0.13% (  38.56%)   34946.28  +- 0.13% (  15.02%)   38635.79  +- 0.06% (  27.16%)
      Hmean  256    30660.23  +- 1.08% (  -1.55%)   32307.67  +- 0.54% (   3.74%)   31153.27  +- 0.12% (   0.03%)
      Hmean  512    24604.32  +- 0.14% ( -20.27%)   40408.50  +- 1.10% (  30.95%)   38800.29  +- 1.23% (  25.74%)
      Hmean  1024   35535.47  +- 0.28% (  -9.32%)   41070.38  +- 2.56% (   4.81%)   31308.29  +- 2.52% ( -20.11%)
      
      Benchmark          : dbench (filesystem stressor)
      Varying parameter  : number of clients
      Unit               : seconds (lower is better)
      
      NOTE-1: This dbench version measures the average latency of a set of filesystem
              operations, as we found the traditional dbench metric (throughput) to be
      	misleading.
      NOTE-2: Due to high variability, we partition the original dataset and apply
              statistical bootrapping (a resampling method). Accuracy is reported in the
      	form of 95% confidence intervals.
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      SubAmean  1         98.79  +- 0.92 (        )      83.36  +- 0.82 (  15.62%)      84.82  +- 0.92 (  14.14%)
      SubAmean  2        116.00  +- 0.89 (        )     102.12  +- 0.77 (  11.96%)     109.63  +- 0.89 (   5.49%)
      SubAmean  4        149.90  +- 1.03 (        )     132.12  +- 0.91 (  11.86%)     143.90  +- 1.15 (   4.00%)
      SubAmean  8        182.41  +- 1.13 (        )     159.86  +- 0.93 (  12.36%)     165.82  +- 1.03 (   9.10%)
      SubAmean  16       237.83  +- 1.23 (        )     219.46  +- 1.14 (   7.72%)     229.28  +- 1.19 (   3.59%)
      SubAmean  32       334.34  +- 1.49 (        )     309.94  +- 1.42 (   7.30%)     321.19  +- 1.36 (   3.93%)
      SubAmean  64       576.61  +- 2.16 (        )     540.75  +- 2.00 (   6.22%)     551.27  +- 1.99 (   4.39%)
      SubAmean  128     1350.07  +- 4.14 (        )    1205.47  +- 3.20 (  10.71%)    1280.26  +- 3.75 (   5.17%)
      SubAmean  256     3444.42  +- 7.97 (        )    3698.00 +- 27.43 (  -7.36%)    3494.14  +- 7.81 (  -1.44%)
      SubAmean  2048   39457.89 +- 29.01 (        )   34105.33 +- 41.85 (  13.57%)   39688.52 +- 36.26 (  -0.58%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      SubAmean  1         85.68  +- 1.04 (  13.27%)      84.16  +- 0.84 (  14.81%)      83.99  +- 0.90 (  14.99%)
      SubAmean  2        108.42  +- 0.95 (   6.54%)     109.91  +- 1.39 (   5.24%)     112.06  +- 0.91 (   3.39%)
      SubAmean  4        136.90  +- 1.04 (   8.67%)     137.59  +- 0.93 (   8.21%)     136.55  +- 0.95 (   8.91%)
      SubAmean  8        163.15  +- 0.96 (  10.56%)     166.07  +- 1.02 (   8.96%)     165.81  +- 0.99 (   9.10%)
      SubAmean  16       224.86  +- 1.12 (   5.45%)     223.83  +- 1.06 (   5.89%)     230.66  +- 1.19 (   3.01%)
      SubAmean  32       320.51  +- 1.38 (   4.13%)     322.85  +- 1.49 (   3.44%)     321.96  +- 1.46 (   3.70%)
      SubAmean  64       553.25  +- 1.93 (   4.05%)     554.19  +- 2.08 (   3.89%)     562.26  +- 2.22 (   2.49%)
      SubAmean  128     1264.35  +- 3.72 (   6.35%)    1256.99  +- 3.46 (   6.89%)    2018.97 +- 18.79 ( -49.55%)
      SubAmean  256     3466.25  +- 8.25 (  -0.63%)    3450.58  +- 8.44 (  -0.18%)    5032.12 +- 38.74 ( -46.09%)
      SubAmean  2048   39133.10 +- 45.71 (   0.82%)   39905.95 +- 34.33 (  -1.14%)   53811.86 +-193.04 ( -36.38%)
      
      Benchmark          : kernbench (kernel compilation)
      Varying parameter  : number of jobs
      Unit               : seconds (lower is better)
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        471.71 +- 26.61% (        )     409.88 +- 16.99% (  13.11%)     430.63  +- 0.18% (   8.71%)
      Amean  4        211.87  +- 0.58% (        )     194.03  +- 0.74% (   8.42%)     215.33  +- 0.64% (  -1.63%)
      Amean  8        109.79  +- 1.27% (        )     101.43  +- 1.53% (   7.61%)     111.05  +- 1.95% (  -1.15%)
      Amean  16        59.50  +- 1.28% (        )      55.61  +- 1.35% (   6.55%)      59.65  +- 1.78% (  -0.24%)
      Amean  32        34.94  +- 1.22% (        )      32.36  +- 1.95% (   7.41%)      35.44  +- 0.63% (  -1.43%)
      Amean  64        22.58  +- 0.38% (        )      20.97  +- 1.28% (   7.11%)      22.41  +- 1.73% (   0.74%)
      Amean  128       17.72  +- 0.44% (        )      16.68  +- 0.32% (   5.88%)      17.65  +- 0.96% (   0.37%)
      Amean  256       16.44  +- 0.53% (        )      15.76  +- 0.32% (   4.18%)      16.76  +- 0.60% (  -1.93%)
      Amean  512       16.54  +- 0.21% (        )      15.62  +- 0.41% (   5.53%)      16.84  +- 0.85% (  -1.83%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        421.30  +- 0.24% (  10.69%)     419.26  +- 0.15% (  11.12%)     414.38  +- 0.33% (  12.15%)
      Amean  4  	217.81  +- 5.53% (  -2.80%)     211.63  +- 0.99% (   0.12%)     208.43  +- 0.47% (   1.63%)
      Amean  8  	108.80  +- 0.43% (   0.90%)     108.48  +- 1.44% (   1.19%)     108.59  +- 3.08% (   1.09%)
      Amean  16 	 58.84  +- 0.74% (   1.12%)      58.37  +- 0.94% (   1.91%)      57.78  +- 0.78% (   2.90%)
      Amean  32 	 34.04  +- 2.00% (   2.59%)      34.28  +- 1.18% (   1.91%)      33.98  +- 2.21% (   2.75%)
      Amean  64 	 22.22  +- 1.69% (   1.60%)      22.27  +- 1.60% (   1.38%)      22.25  +- 1.41% (   1.47%)
      Amean  128	 17.55  +- 0.24% (   0.97%)      17.53  +- 0.94% (   1.04%)      17.49  +- 0.43% (   1.30%)
      Amean  256	 16.51  +- 0.46% (  -0.40%)      16.48  +- 0.48% (  -0.19%)      16.44  +- 1.21% (   0.00%)
      Amean  512	 16.50  +- 0.35% (   0.19%)      16.35  +- 0.42% (   1.14%)      16.37  +- 0.33% (   0.99%)
      
      Benchmark          : gitsource (time to run the git unit test suite)
      Varying parameter  : none
      Unit               : seconds (lower is better)
      
                        5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean          1035.76  +- 0.30% (        )     688.21  +- 0.04% (  33.56%)    1003.85  +- 0.14% (   3.08%)
      
                                  5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean           995.82  +- 0.08% (   3.86%)    1011.98  +- 0.03% (   2.30%)     986.87  +- 0.19% (   4.72%)
      
      3. POWER CONSUMPTION TABLE
      ==========================
      
      Average power consumption (watts).
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                  ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      tbench4     227.25     281.83     244.17     236.76     241.50     247.99
      dbench4     151.97     161.87     157.08     158.10     158.06     153.73
      kernbench   162.78     167.22     162.90     164.19     164.65     164.72
      gitsource   133.65     139.00     133.04     134.43     134.18     134.32
      
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
      976df7e5
    • Nathan Fontenot's avatar
      x86, sched: Calculate frequency invariance for AMD systems · 41ea6672
      Nathan Fontenot authored
      
      This is the first pass in creating the ability to calculate the
      frequency invariance on AMD systems. This approach uses the CPPC
      highest performance and nominal performance values that range from
      0 - 255 instead of a high and base frquency. This is because we do
      not have the ability on AMD to get a highest frequency value.
      
      On AMD systems the highest performance and nominal performance
      vaues do correspond to the highest and base frequencies for the system
      so using them should produce an appropriate ratio but some tweaking
      is likely necessary.
      
      Due to CPPC being initialized later in boot than when the frequency
      invariant calculation is currently made, I had to create a callback
      from the CPPC init code to do the calculation after we have CPPC
      data.
      
      Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
      compilation of drivers/acpi/cppc_acpi.c is conditional to
      CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.
      
      [ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]
      
      Signed-off-by: default avatarNathan Fontenot <nathan.fontenot@amd.com>
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
      41ea6672
  2. Nov 24, 2020
  3. Nov 23, 2020
    • Sven Schnelle's avatar
      s390: fix fpu restore in entry.S · 1179f170
      Sven Schnelle authored
      
      We need to disable interrupts in load_fpu_regs(). Otherwise an
      interrupt might come in after the registers are loaded, but before
      CIF_FPU is cleared in load_fpu_regs(). When the interrupt returns,
      CIF_FPU will be cleared and the registers will never be restored.
      
      The entry.S code usually saves the interrupt state in __SF_EMPTY on the
      stack when disabling/restoring interrupts. sie64a however saves the pointer
      to the sie control block in __SF_SIE_CONTROL, which references the same
      location.  This is non-obvious to the reader. To avoid thrashing the sie
      control block pointer in load_fpu_regs(), move the __SIE_* offsets eight
      bytes after __SF_EMPTY on the stack.
      
      Cc: <stable@vger.kernel.org> # 5.8
      Fixes: 0b0ed657 ("s390: remove critical section cleanup from entry.S")
      Reported-by: default avatarPierre Morel <pmorel@linux.ibm.com>
      Signed-off-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      1179f170
  4. Nov 22, 2020
  5. Nov 19, 2020
    • Daniel Axtens's avatar
      powerpc/64s: rename pnv|pseries_setup_rfi_flush to _setup_security_mitigations · da631f7f
      Daniel Axtens authored
      
      pseries|pnv_setup_rfi_flush already does the count cache flush setup, and
      we just added entry and uaccess flushes. So the name is not very accurate
      any more. In both platforms we then also immediately setup the STF flush.
      
      Rename them to _setup_security_mitigations and fold the STF flush in.
      
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      da631f7f
    • Michael Ellerman's avatar
      powerpc: Only include kup-radix.h for 64-bit Book3S · 178d52c6
      Michael Ellerman authored
      
      In kup.h we currently include kup-radix.h for all 64-bit builds, which
      includes Book3S and Book3E. The latter doesn't make sense, Book3E
      never uses the Radix MMU.
      
      This has worked up until now, but almost by accident, and the recent
      uaccess flush changes introduced a build breakage on Book3E because of
      the bad structure of the code.
      
      So disentangle things so that we only use kup-radix.h for Book3S. This
      requires some more stubs in kup.h and fixing an include in
      syscall_64.c.
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      178d52c6
    • Nicholas Piggin's avatar
      powerpc/64s: flush L1D after user accesses · 9a32a7e7
      Nicholas Piggin authored
      
      IBM Power9 processors can speculatively operate on data in the L1 cache
      before it has been completely validated, via a way-prediction mechanism. It
      is not possible for an attacker to determine the contents of impermissible
      memory using this method, since these systems implement a combination of
      hardware and software security measures to prevent scenarios where
      protected data could be leaked.
      
      However these measures don't address the scenario where an attacker induces
      the operating system to speculatively execute instructions using data that
      the attacker controls. This can be used for example to speculatively bypass
      "kernel user access prevention" techniques, as discovered by Anthony
      Steinhauser of Google's Safeside Project. This is not an attack by itself,
      but there is a possibility it could be used in conjunction with
      side-channels or other weaknesses in the privileged code to construct an
      attack.
      
      This issue can be mitigated by flushing the L1 cache between privilege
      boundaries of concern. This patch flushes the L1 cache after user accesses.
      
      This is part of the fix for CVE-2020-4788.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9a32a7e7
    • Nicholas Piggin's avatar
      powerpc/64s: flush L1D on kernel entry · f7964378
      Nicholas Piggin authored
      
      IBM Power9 processors can speculatively operate on data in the L1 cache
      before it has been completely validated, via a way-prediction mechanism. It
      is not possible for an attacker to determine the contents of impermissible
      memory using this method, since these systems implement a combination of
      hardware and software security measures to prevent scenarios where
      protected data could be leaked.
      
      However these measures don't address the scenario where an attacker induces
      the operating system to speculatively execute instructions using data that
      the attacker controls. This can be used for example to speculatively bypass
      "kernel user access prevention" techniques, as discovered by Anthony
      Steinhauser of Google's Safeside Project. This is not an attack by itself,
      but there is a possibility it could be used in conjunction with
      side-channels or other weaknesses in the privileged code to construct an
      attack.
      
      This issue can be mitigated by flushing the L1 cache between privilege
      boundaries of concern. This patch flushes the L1 cache on kernel entry.
      
      This is part of the fix for CVE-2020-4788.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f7964378
    • Ionela Voinescu's avatar
      arm64: Rebuild sched domains on invariance status changes · ecec9e86
      Ionela Voinescu authored
      
      Task scheduler behavior depends on frequency invariance (FI) support and
      the resulting invariant load tracking signals. For example, in order to
      make accurate predictions across CPUs for all performance states, Energy
      Aware Scheduling (EAS) needs frequency-invariant load tracking signals
      and therefore it has a direct dependency on FI. This dependency is known,
      but EAS enablement is not yet conditioned on the presence of FI during
      the built of the scheduling domain hierarchy.
      
      Before this is done, the following must be considered: while
      arch_scale_freq_invariant() will see changes in FI support and could
      be used to condition the use of EAS, it could return different values
      during system initialisation.
      
      For arm64, such a scenario will happen for a system that does not support
      cpufreq driven FI, but does support counter-driven FI. For such a system,
      arch_scale_freq_invariant() will return false if called before counter
      based FI initialisation, but change its status to true after it.
      If EAS becomes explicitly dependent on FI this would affect the task
      scheduler behavior which builds its scheduling domain hierarchy well
      before the late counter-based FI init. During that process, EAS would be
      disabled due to its dependency on FI.
      
      Two points of future early calls to arch_scale_freq_invariant() which
      would determine EAS enablement are:
       - (1) drivers/base/arch_topology.c:126 <<update_topology_flags_workfn>>
      		rebuild_sched_domains();
             This will happen after CPU capacity initialisation.
       - (2) kernel/sched/cpufreq_schedutil.c:917 <<rebuild_sd_workfn>>
      		rebuild_sched_domains_energy();
      		-->rebuild_sched_domains();
             This will happen during sched_cpufreq_governor_change() for the
             schedutil cpufreq governor.
      
      Therefore, before enforcing the presence of FI support for the use of EAS,
      ensure the following: if there is a change in FI support status after
      counter init, use the existing rebuild_sched_domains_energy() function to
      trigger a rebuild of the scheduling and performance domains that in turn
      will determine the enablement of EAS.
      
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lkml.kernel.org/r/20201027180713.7642-3-ionela.voinescu@arm.com
      ecec9e86
  6. Nov 18, 2020
  7. Nov 17, 2020
  8. Nov 16, 2020
    • Max Filippov's avatar
      xtensa: disable preemption around cache alias management calls · 3a860d16
      Max Filippov authored
      
      Although cache alias management calls set up and tear down TLB entries
      and fast_second_level_miss is able to restore TLB entry should it be
      evicted they absolutely cannot preempt each other because they use the
      same TLBTEMP area for different purposes.
      Disable preemption around all cache alias management calls to enforce
      that.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      3a860d16
    • Max Filippov's avatar
      xtensa: fix TLBTEMP area placement · 481535c5
      Max Filippov authored
      
      fast_second_level_miss handler for the TLBTEMP area has an assumption
      that page table directory entry for the TLBTEMP address range is 0. For
      it to be true the TLBTEMP area must be aligned to 4MB boundary and not
      share its 4MB region with anything that may use a page table. This is
      not true currently: TLBTEMP shares space with vmalloc space which
      results in the following kinds of runtime errors when
      fast_second_level_miss loads page table directory entry for the vmalloc
      space instead of fixing up the TLBTEMP area:
      
       Unable to handle kernel paging request at virtual address c7ff0e00
        pc = d0009275, ra = 90009478
       Oops: sig: 9 [#1] PREEMPT
       CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty #58
       Workqueue: xprtiod xs_stream_data_receive_workfn
       a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000
       a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90
       pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00
       lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010
       Call Trace:
         xs_stream_data_receive_workfn+0x43c/0x770
         process_one_work+0x1a1/0x324
         worker_thread+0x1cc/0x3c0
         kthread+0x10d/0x124
         ret_from_kernel_thread+0xc/0x18
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      481535c5
  9. Nov 15, 2020
    • Paolo Bonzini's avatar
      kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use · c887c9b9
      Paolo Bonzini authored
      
      In some cases where shadow paging is in use, the root page will
      be either mmu->pae_root or vcpu->arch.mmu->lm_root.  Then it will
      not have an associated struct kvm_mmu_page, because it is allocated
      with alloc_page instead of kvm_mmu_alloc_page.
      
      Just return false quickly from is_tdp_mmu_root if the TDP MMU is
      not in use, which also includes the case where shadow paging is
      enabled.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c887c9b9
  10. Nov 13, 2020
  11. Nov 12, 2020
Loading