Skip to content
  • Vlastimil Babka's avatar
    mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
    Vlastimil Babka authored
    
    
    Similarly to direct reclaim/compaction, kswapd attempts to combine
    reclaim and compaction to attempt making memory allocation of given
    order available.
    
    The details differ from direct reclaim e.g. in having high watermark as
    a goal.  The code involved in kswapd's reclaim/compaction decisions has
    evolved to be quite complex.
    
    Testing reveals that it doesn't actually work in at least one scenario,
    and closer inspection suggests that it could be greatly simplified
    without compromising on the goal (make high-order page available) or
    efficiency (don't reclaim too much).  The simplification relieas of
    doing all compaction in kcompactd, which is simply woken up when high
    watermarks are reached by kswapd's reclaim.
    
    The scenario where kswapd compaction doesn't work was found with mmtests
    test stress-highalloc configured to attempt order-9 allocations without
    direct reclaim, just waking up kswapd.  There was no compaction attempt
    from kswapd during the whole test.  Some added instrumentation shows
    what happens:
    
     - balance_pgdat() sets end_zone to Normal, as it's not balanced
     - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
       it cannot reclaim anything, so sc.nr_reclaimed is 0
     - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
       it merely checks if high watermarks were reached for base pages.
       This is true, so no reclaim is attempted.  For DMA, testorder=0
       wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
     - even though the pgdat_needs_compaction flag wasn't set to false, no
       compaction happens due to the condition sc.nr_reclaimed >
       nr_attempted being false (as 0 < 99)
     - priority-- due to nr_reclaimed being 0, repeat until priority reaches
       0 pgdat_balanced() is false as only the small zone DMA appears
       balanced (curiously in that check, watermark appears OK and
       compaction_suitable() returns COMPACT_PARTIAL, because a lower
       classzone_idx is used there)
    
    Now, even if it was decided that reclaim shouldn't be attempted on the
    DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
    nr_attempted=0) is also false.  The condition really should use >= as
    the comment suggests.  Then there is a mismatch in the check for setting
    pgdat_needs_compaction to false using low watermark, while the rest uses
    high watermark, and who knows what other subtlety.  Hopefully this
    demonstrates that this is unsustainable.
    
    Luckily we can simplify this a lot.  The reclaim/compaction decisions
    make sense for direct reclaim scenario, but in kswapd, our primary goal
    is to reach high watermark in order-0 pages.  Afterwards we can attempt
    compaction just once.  Unlike direct reclaim, we don't reclaim extra
    pages (over the high watermark), the current code already disallows it
    for good reasons.
    
    After this patch, we simply wake up kcompactd to process the pgdat,
    after we have either succeeded or failed to reach the high watermarks in
    kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
    so kcompactd can apply the same criteria to determine which zones are
    worth compacting.  Note that we use the classzone_idx from
    wakeup_kswapd(), not balanced_classzone_idx which can include higher
    zones that kswapd tried to balance too, but didn't consider them in
    pgdat_balanced().
    
    Since kswapd now cannot create high-order pages itself, we need to
    adjust how it determines the zones to be balanced.  The key element here
    is adding a "highorder" parameter to zone_balanced, which, when set to
    false, makes it consider only order-0 watermark instead of the desired
    higher order (this was done previously by kswapd_shrink_zone(), but not
    elsewhere).  This false is passed for example in pgdat_balanced().
    Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
    kcompactd are woken up for a high-order allocation failure.
    
    The last thing is to decide what to do with pageblock_skip bitmap
    handling.  Compaction maintains a pageblock_skip bitmap to record
    pageblocks where isolation recently failed.  This bitmap can be reset by
    three ways:
    
    1) direct compaction is restarting after going through the full deferred cycle
    
    2) kswapd goes to sleep, and some other direct compaction has previously
       finished scanning the whole zone and set zone->compact_blockskip_flush.
       Note that a successful direct compaction clears this flag.
    
    3) compaction was invoked manually via trigger in /proc
    
    The case 2) is somewhat fuzzy to begin with, but after introducing
    kcompactd we should update it.  The check for direct compaction in 1),
    and to set the flush flag in 2) use current_is_kswapd(), which doesn't
    work for kcompactd.  Thus, this patch adds bool direct_compaction to
    compact_control to use in 2).  For the case 1) we remove the check
    completely - unlike the former kswapd compaction, kcompactd does use the
    deferred compaction functionality, so flushing tied to restarting from
    deferred compaction makes sense here.
    
    Note that when kswapd goes to sleep, kcompactd is woken up, so it will
    see the flushed pageblock_skip bits.  This is different from when the
    former kswapd compaction observed the bits and I believe it makes more
    sense.  Kcompactd can afford to be more thorough than a direct
    compaction trying to limit allocation latency, or kswapd whose primary
    goal is to reclaim.
    
    For testing, I used stress-highalloc configured to do order-9
    allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
    on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
    phases 1 and 2 work as usual):
    
    stress-highalloc
                            4.5-rc1+before          4.5-rc1+after
                                 -nodirect              -nodirect
    Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
    Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
    Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
    Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
    Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
    Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
    Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
    Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
    Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
    
    User                          3166.67        3181.09
    System                        1153.37        1158.25
    Elapsed                       1768.53        1799.37
    
                                4.5-rc1+before   4.5-rc1+after
                                     -nodirect    -nodirect
    Direct pages scanned                32938        32797
    Kswapd pages scanned              2183166      2202613
    Kswapd pages reclaimed            2152359      2143524
    Direct pages reclaimed              32735        32545
    Percentage direct scans                1%           1%
    THP fault alloc                       579          612
    THP collapse alloc                    304          316
    THP splits                              0            0
    THP fault fallback                    793          778
    THP collapse fail                      11           16
    Compaction stalls                    1013         1007
    Compaction success                     92           67
    Compaction failures                   920          939
    Page migrate success               238457       721374
    Page migrate failure                23021        23469
    Compaction pages isolated          504695      1479924
    Compaction migrate scanned         661390      8812554
    Compaction free scanned          13476658     84327916
    Compaction cost                       262          838
    
    After this patch we see improvements in allocation success rate
    (especially for phase 3) along with increased compaction activity.  The
    compaction stalls (direct compaction) in the interfering kernel builds
    (probably THP's) also decreased somewhat thanks to kcompactd activity,
    yet THP alloc successes improved a bit.
    
    Note that elapsed and user time isn't so useful for this benchmark,
    because of the background interference being unpredictable.  It's just
    to quickly spot some major unexpected differences.  System time is
    somewhat more useful and that didn't increase.
    
    Also (after adjusting mmtests' ftrace monitor):
    
    Time kswapd awake               2547781     2269241
    Time kcompactd awake                  0      119253
    Time direct compacting           939937      557649
    Time kswapd compacting                0           0
    Time kcompactd compacting             0      119099
    
    The decrease of overal time spent compacting appears to not match the
    increased compaction stats.  I suspect the tasks get rescheduled and
    since the ftrace monitor doesn't see that, the reported time is wall
    time, not CPU time.  But arguably direct compactors care about overall
    latency anyway, whether busy compacting or waiting for CPU doesn't
    matter.  And that latency seems to almost halved.
    
    It's also interesting how much time kswapd spent awake just going
    through all the priorities and failing to even try compacting, over and
    over.
    
    We can also configure stress-highalloc to perform both direct
    reclaim/compaction and wakeup kswapd/kcompactd, by using
    GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
    
    stress-highalloc
                            4.5-rc1+before         4.5-rc1+after
                                   -direct               -direct
    Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
    Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
    Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
    Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
    Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
    Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
    Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
    Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
    Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
    
    User                          3344.73       3246.04
    System                        1194.24       1172.29
    Elapsed                       1838.04       1836.76
    
                                4.5-rc1+before  4.5-rc1+after
                                       -direct     -direct
    Direct pages scanned               125146      120966
    Kswapd pages scanned              2119757     2135012
    Kswapd pages reclaimed            2073183     2108388
    Direct pages reclaimed             124909      120577
    Percentage direct scans                5%          5%
    THP fault alloc                       599         652
    THP collapse alloc                    323         354
    THP splits                              0           0
    THP fault fallback                    806         793
    THP collapse fail                      17          16
    Compaction stalls                    2457        2025
    Compaction success                    906         518
    Compaction failures                  1551        1507
    Page migrate success              2031423     2360608
    Page migrate failure                32845       40852
    Compaction pages isolated         4129761     4802025
    Compaction migrate scanned       11996712    21750613
    Compaction free scanned         214970969   344372001
    Compaction cost                      2271        2694
    
    In this scenario, this patch doesn't change the overall success rate as
    direct compaction already tries all it can.  There's however significant
    reduction in direct compaction stalls (that is, the number of
    allocations that went into direct compaction).  The number of successes
    (i.e.  direct compaction stalls that ended up with successful
    allocation) is reduced by the same number.  This means the offload to
    kcompactd is working as expected, and direct compaction is reduced
    either due to detecting contention, or compaction deferred by kcompactd.
    In the previous version of this patchset there was some apparent
    reduction of success rate, but the changes in this version (such as
    using sync compaction only), new baseline kernel, and/or averaging
    results from 5 executions (my bet), made this go away.
    
    Ftrace-based stats seem to roughly agree:
    
    Time kswapd awake               2532984     2326824
    Time kcompactd awake                  0      257916
    Time direct compacting           864839      735130
    Time kswapd compacting                0           0
    Time kcompactd compacting             0      257585
    
    Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    accf6242