1. 25 Feb, 2017 1 commit
  2. 23 Feb, 2017 1 commit
    • David Rientjes's avatar
      mm, thp: add new defer+madvise defrag option · 21440d7e
      David Rientjes authored
      There is no thp defrag option that currently allows MADV_HUGEPAGE
      regions to do direct compaction and reclaim while all other thp
      allocations simply trigger kswapd and kcompactd in the background and
      fail immediately.
      
      The "defer" setting simply triggers background reclaim and compaction
      for all regions, regardless of MADV_HUGEPAGE, which makes it unusable
      for our userspace where MADV_HUGEPAGE is being used to indicate the
      application is willing to wait for work for thp memory to be available.
      
      The "madvise" setting will do direct compaction and reclaim for these
      MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the
      background for anybody else.
      
      For reasonable usage, there needs to be a mesh between the two options.
      This patch introduces a fifth mode, "defer+madvise", that will do direct
      reclaim and compaction for MADV_HUGEPAGE regions and trigger background
      reclaim and compaction for everybody else so that hugepages may be
      available in the near future.
      
      A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE
      regions as part of the "defer" mode, making it a very powerful setting
      and avoids breaking userspace, was offered:
           http://marc.info/?t=148236612700003
      This additional mode is a compromise.
      
      A second proposal to allow both "defer" and "madvise" to be selected at
      the same time was also offered:
           http://marc.info/?t=148357345300001.
      This is possible, but there was a concern that it might break existing
      userspaces the parse the output of the defrag mode, so the fifth option
      was introduced instead.
      
      This patch also cleans up the helper function for storing to "enabled"
      and "defrag" since the former supports three modes while the latter
      supports five and triple_flag_store() was getting unnecessarily messy.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701101614330.41805@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21440d7e
  3. 15 Dec, 2016 1 commit
  4. 13 Dec, 2016 1 commit
  5. 17 Nov, 2016 1 commit
    • Aaron Lu's avatar
      mremap: fix race between mremap() and page cleanning · 5d190420
      Aaron Lu authored
      Prior to 3.15, there was a race between zap_pte_range() and
      page_mkclean() where writes to a page could be lost.  Dave Hansen
      discovered by inspection that there is a similar race between
      move_ptes() and page_mkclean().
      
      We've been able to reproduce the issue by enlarging the race window with
      a msleep(), but have not been able to hit it without modifying the code.
      So, we think it's a real issue, but is difficult or impossible to hit in
      practice.
      
      The zap_pte_range() issue is fixed by commit 1cf35d47("mm: split
      'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
      this patch is to fix the race between page_mkclean() and mremap().
      
      Here is one possible way to hit the race: suppose a process mmapped a
      file with READ | WRITE and SHARED, it has two threads and they are bound
      to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
      1 did a write to addr X so that CPU1 now has a writable TLB for addr X
      on it.  Thread 2 starts mremaping from addr X to Y while thread 1
      cleaned the page and then did another write to the old addr X again.
      The 2nd write from thread 1 could succeed but the value will get lost.
      
              thread 1                           thread 2
           (bound to CPU1)                    (bound to CPU2)
      
        1: write 1 to addr X to get a
           writeable TLB on this CPU
      
                                              2: mremap starts
      
                                              3: move_ptes emptied PTE for addr X
                                                 and setup new PTE for addr Y and
                                                 then dropped PTL for X and Y
      
        4: page laundering for N by doing
           fadvise FADV_DONTNEED. When done,
           pageframe N is deemed clean.
      
        5: *write 2 to addr X
      
                                              6: tlb flush for addr X
      
        7: munmap (Y, pagesize) to make the
           page unmapped
      
        8: fadvise with FADV_DONTNEED again
           to kick the page off the pagecache
      
        9: pread the page from file to verify
           the value. If 1 is there, it means
           we have lost the written 2.
      
        *the write may or may not cause segmentation fault, it depends on
        if the TLB is still on the CPU.
      
      Please note that this is only one specific way of how the race could
      occur, it didn't mean that the race could only occur in exact the above
      config, e.g. more than 2 threads could be involved and fadvise() could
      be done in another thread, etc.
      
      For anonymous pages, they could race between mremap() and page reclaim:
      THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
      PMD gets unmapped/splitted/pagedout before the flush tlb happened for
      the old huge PMD in move_page_tables() and we could still write data to
      it.  The normal anonymous page has similar situation.
      
      To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
      if any, did the flush before dropping the PTL.  If we did the flush for
      every move_ptes()/move_huge_pmd() call then we do not need to do the
      flush in move_pages_tables() for the whole range.  But if we didn't, we
      still need to do the whole range flush.
      
      Alternatively, we can track which part of the range is flushed in
      move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
      range in move_page_tables().  But that would require multiple tlb
      flushes for the different sub-ranges and should be less efficient than
      the single whole range flush.
      
      KBuild test on my Sandybridge desktop doesn't show any noticeable change.
      v4.9-rc4:
        real    5m14.048s
        user    32m19.800s
        sys     4m50.320s
      
      With this commit:
        real    5m13.888s
        user    32m19.330s
        sys     4m51.200s
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d190420
  6. 08 Oct, 2016 2 commits
    • Aaron Lu's avatar
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu authored
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • Toshi Kani's avatar
      thp, dax: add thp_get_unmapped_area for pmd mappings · 74d2fad1
      Toshi Kani authored
      When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
      This feature relies on both mmap virtual address and FS block (i.e.
      physical address) to be aligned by the pmd page size.  Users can use
      mkfs options to specify FS to align block allocations.  However,
      aligning mmap address requires code changes to existing applications for
      providing a pmd-aligned address to mmap().
      
      For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
      It calls mmap() with a NULL address, which needs to be changed to
      provide a pmd-aligned address for testing with DAX pmd mappings.
      Changing all applications that call mmap() with NULL is undesirable.
      
      Add thp_get_unmapped_area(), which can be called by filesystem's
      get_unmapped_area to align an mmap address by the pmd size for a DAX
      file.  It calls the default handler, mm->get_unmapped_area(), to find a
      range and then aligns it for a DAX file.
      
      The patch is based on Matthew Wilcox's change that allows adding support
      of the pud page size easily.
      
      [1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
      Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.comSigned-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74d2fad1
  7. 28 Jul, 2016 1 commit
  8. 26 Jul, 2016 5 commits
  9. 15 Jul, 2016 1 commit
    • Naoya Horiguchi's avatar
      mm: thp: move pmd check inside ptl for freeze_page() · 33f4751e
      Naoya Horiguchi authored
      I found a race condition triggering VM_BUG_ON() in freeze_page(), when
      running a testcase with 3 processes:
        - process 1: keep writing thp,
        - process 2: keep clearing soft-dirty bits from virtual address of process 1
        - process 3: call migratepages for process 1,
      
      The kernel message is like this:
      
        kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio
        CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000
        RIP: 0010:[<ffffffff811f8e06>]  [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590
        RSP: 0018:ffff88007cdd3b70  EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000
        RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000
        R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000
        R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080
        FS:  00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000             CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0
        Call Trace:
          ? list_del+0xd/0x30
          queue_pages_pte_range+0x4d1/0x590
          __walk_page_range+0x204/0x4e0
          walk_page_range+0x71/0xf0
          queue_pages_range+0x75/0x90
          ? queue_pages_hugetlb+0x190/0x190
          ? new_node_page+0xc0/0xc0
          ? change_prot_numa+0x40/0x40
          migrate_to_node+0x71/0xd0
          do_migrate_pages+0x1c3/0x210
          SyS_migrate_pages+0x261/0x290
          entry_SYSCALL_64_fastpath+0x1a/0xa4
        Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
        RIP   split_huge_page_to_list+0x496/0x590
      
      I'm not sure of the full scenario of the reproduction, but my debug
      showed that split_huge_pmd_address(freeze=true) returned without running
      main code of pmd splitting because pmd_present(*pmd) in precheck somehow
      returned 0.  If this happens, the subsequent try_to_unmap() fails and
      returns non-zero (because page_mapcount() still > 0), and finally
      VM_BUG_ON() fires.  This patch tries to fix it by prechecking pmd state
      inside ptl.
      
      Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33f4751e
  10. 20 May, 2016 1 commit
    • Hugh Dickins's avatar
      huge mm: move_huge_pmd does not need new_vma · bf8616d5
      Hugh Dickins authored
      Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
      a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
      the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
      it was in the old vma, alignment and size permitting.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf8616d5
  11. 29 Apr, 2016 1 commit
  12. 01 Apr, 2016 1 commit
  13. 17 Mar, 2016 3 commits
    • Kirill A. Shutemov's avatar
      thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers · fec89c10
      Kirill A. Shutemov authored
      freeze_page() and unfreeze_page() helpers evolved in rather complex
      beasts.  It would be nice to cut complexity of this code.
      
      This patch rewrites freeze_page() using standard try_to_unmap().
      unfreeze_page() is rewritten with remove_migration_ptes().
      
      The result is much simpler.
      
      But the new variant is somewhat slower for PTE-mapped THPs.  Current
      helpers iterates over VMAs the compound page is mapped to, and then over
      ptes within this VMA.  New helpers iterates over small page, then over
      VMA the small page mapped to, and only then find relevant pte.
      
      We have short cut for PMD-mapped THP: we directly install migration
      entries on PMD split.
      
      I don't think the slowdown is critical, considering how much simpler
      result is and that split_huge_page() is quite rare nowadays.  It only
      happens due memory pressure or migration.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fec89c10
    • Kirill A. Shutemov's avatar
      rmap: extend try_to_unmap() to be usable by split_huge_page() · 2a52bcbc
      Kirill A. Shutemov authored
      Add support for two ttu_flags:
      
        - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
          unmap page;
      
        - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
      
      Also, change rwc->done to !page_mapcount() instead of !page_mapped().
      try_to_unmap() works on pte level, so we are really interested in the
      mappedness of this small page rather than of the compound page it's a
      part of.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a52bcbc
    • Mel Gorman's avatar
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman authored
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
  14. 03 Mar, 2016 1 commit
  15. 22 Jan, 2016 1 commit
  16. 16 Jan, 2016 11 commits
  17. 08 Sep, 2015 4 commits
  18. 13 Feb, 2015 2 commits
  19. 12 Feb, 2015 1 commit