Skip to content
  • Mel Gorman's avatar
    sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
    Mel Gorman authored
    The select_idle_sibling() (SIS) rewrite in commit:
    
      10e2f1ac
    
     ("sched/core: Rewrite and improve select_idle_siblings()")
    
    ... replaced a domain iteration with a search that broadly speaking
    does a wrapped walk of the scheduler domain sharing a last-level-cache.
    
    While this had a number of improvements, one consequence is that two tasks
    that share a waker/wakee relationship push each other around a socket. Even
    though two tasks may be active, all cores are evenly used. This is great from
    a search perspective and spreads a load across individual cores, but it has
    adverse consequences for cpufreq. As each CPU has relatively low utilisation,
    cpufreq may decide the utilisation is too low to used a higher P-state and
    overall computation throughput suffers.
    
    While individual cpufreq and cpuidle drivers may compensate by artifically
    boosting P-state (at c0) or avoiding lower C-states (during idle), it does
    not help if hardware-based cpufreq (e.g. HWP) is used.
    
    This patch tracks a recently used CPU based on what CPU a task was running
    on when it last was a waker a CPU it was recently using when a task is a
    wakee. During SIS, the recently used CPU is used as a target if it's still
    allowed by the task and is idle.
    
    The benefit may be non-obvious so consider an example of two tasks
    communicating back and forth. Task A may be an application doing IO where
    task B is a kworker or kthread like journald. Task A may issue IO, wake
    B and B wakes up A on completion.  With the existing scheme this may look
    like the following (potentially different IDs if SMT is in use but similar
    principal applies).
    
     A (cpu 0)	wake	B (wakes on cpu 1)
     B (cpu 1)	wake	A (wakes on cpu 2)
     A (cpu 2)	wake	B (wakes on cpu 3)
     etc.
    
    A careful reader may wonder why CPU 0 was not idle when B wakes A the
    first time and it's simply due to the fact that A can be rescheduled to
    another CPU and the pattern is that prev == target when B tries to wakeup A
    and the information about CPU 0 has been lost.
    
    With this patch, the pattern is more likely to be:
    
     A (cpu 0)	wake	B (wakes on cpu 1)
     B (cpu 1)	wake	A (wakes on cpu 0)
     A (cpu 0)	wake	B (wakes on cpu 1)
     etc
    
    i.e. two communicating casts are more likely to use just two cores instead
    of all available cores sharing a LLC.
    
    The most dramatic speedup was noticed on dbench using the XFS filesystem on
    UMA as clients interact heavily with workqueues in that configuration. Note
    that a similar speedup is not observed on ext4 as the wakeup pattern
    is different:
    
                              4.15.0-rc9             4.15.0-rc9
                               waprev-v1        biasancestor-v1
     Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
     Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
     Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
     Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
     Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
    
    The results can be less dramatic on NUMA where automatic balancing interferes
    with the test. It's also known that network benchmarks running on localhost
    also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
    and TCP depending on the machine). Hackbench also seens small improvements
    (6-11% depending on machine and thread count). The facebook schbench was also
    tested but in most cases showed little or no different to wakeup latencies.
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matt Fleming <matt@codeblueprint.co.uk>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
    
    
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    32e839dd