Skip to content
  • Mel Gorman's avatar
    sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages · 09a913a7
    Mel Gorman authored
    change_pte_range is called from task work context to mark PTEs for
    receiving NUMA faulting hints.  If the marked pages are dirty then
    migration may fail.  Some filesystems cannot migrate dirty pages without
    blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
    Even when they can, it can be a waste of cycles when the pages are
    shared forcing higher scan rates.  This patch avoids marking shared
    dirty pages for hinting faults but also will skip a migration if the
    page was dirtied after the scanner updated a clean page.
    
    This is most noticeable running the NASA Parallel Benchmark when backed
    by btrfs, the default root filesystem for some distributions, but also
    noticeable when using XFS.
    
    The following are results from a 4-socket machine running a 4.16-rc4
    kernel with some scheduler patches that are pending for the next merge
    window.
    
                            4.16.0-rc4             4.16.0-rc4
                     schedtip-20180309          nodirty-v1
      Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
      Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
      Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
      Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
      Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)
    
    is.D regresses slightly in terms of absolute time but note that that
    particular load varies quite a bit from run to run.  The more relevant
    observation is the total system CPU usage.
    
                4.16.0-rc4  4.16.0-rc4
              schedtip-20180309 nodirty-v1
      User        71471.91    70627.04
      System      11078.96     8256.13
      Elapsed       661.66      632.74
    
    That is a substantial drop in system CPU usage and overall the workload
    completes faster.  The NUMA balancing statistics are also interesting
    
      NUMA base PTE updates        111407972   139848884
      NUMA huge PMD updates           206506      264869
      NUMA page range updates      217139044   275461812
      NUMA hint faults               4300924     3719784
      NUMA hint local faults         3012539     3416618
      NUMA hint local percent             70          91
      NUMA pages migrated            1517487     1358420
    
    While more PTEs are scanned due to changes in what faults are gathered,
    it's clear that a far higher percentage of faults are local as the bulk
    of the remote hits were dirty pages that, in this case with btrfs, had
    no chance of migrating.
    
    The following is a comparison when using XFS as that is a more realistic
    filesystem choice for a data partition
    
                            4.16.0-rc4             4.16.0-rc4
                     schedtip-20180309          nodirty-v1r47
      Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
      Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
      Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
      Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
      Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)
    
    That is a reasonable gain on two relatively long-lived workloads.  While
    not presented, there is also a substantial drop in system CPu usage and
    the NUMA balancing stats show similar improvements in locality as btrfs
    did.
    
    Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
    
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Reviewed-by: default avatarRik van Riel <riel@surriel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    09a913a7