Skip to content
  • Matt Fleming's avatar
    x86/asm/64: Align start of __clear_user() loop to 16-bytes · bb5570ad
    Matt Fleming authored
    x86 CPUs can suffer severe performance drops if a tight loop, such as
    the ones in __clear_user(), straddles a 16-byte instruction fetch
    window, or worse, a 64-byte cacheline. This issues was discovered in the
    SUSE kernel with the following commit,
    
      11539337 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
    
    which increased the code object size from 10 bytes to 15 bytes and
    caused the 8-byte copy loop in __clear_user() to be split across a
    64-byte cacheline.
    
    Aligning the start of the loop to 16-bytes makes this fit neatly inside
    a single instruction fetch window again and restores the performance of
    __clear_user() which is used heavily when reading from /dev/zero.
    
    Here are some numbers from running libmicro's read_z* and pread_z*
    microbenchmarks which read from /dev/zero:
    
      Zen 1 (Naples)
    
      libmicro-file
                                            5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                                        revert-11539337+               align16+
      Time mean95-pread_z100k       9.9195 (   0.00%)      5.9856 (  39.66%)      5.9938 (  39.58%)
      Time mean95-pread_z10k        1.1378 (   0.00%)      0.7450 (  34.52%)      0.7467 (  34.38%)
      Time mean95-pread_z1k         0.2623 (   0.00%)      0.2251 (  14.18%)      0.2252 (  14.15%)
      Time mean95-pread_zw100k      9.9974 (   0.00%)      6.0648 (  39.34%)      6.0756 (  39.23%)
      Time mean95-read_z100k        9.8940 (   0.00%)      5.9885 (  39.47%)      5.9994 (  39.36%)
      Time mean95-read_z10k         1.1394 (   0.00%)      0.7483 (  34.33%)      0.7482 (  34.33%)
    
    Note that this doesn't affect Haswell or Broadwell microarchitectures
    which seem to avoid the alignment issue by executing the loop straight
    out of the Loop Stream Detector (verified using perf events).
    
    Fixes: 11539337
    
     ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
    Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Cc: <stable@vger.kernel.org> # v4.19+
    Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk
    bb5570ad