1. 26 Oct, 2018 4 commits
    • Andrea Arcangeli's avatar
      mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() · 7eef5f97
      Andrea Arcangeli authored
      There should be no cache left by the time we overwrite the old transhuge
      pmd with the new one.  It's already too late to flush through the virtual
      address because we already copied the page data to the new physical
      address.
      
      So flush the cache before the data copy.
      
      Also delete the "end" variable to shutoff a "unused variable" warning on
      x86 where flush_cache_range() is a noop.
      
      Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7eef5f97
    • Andrea Arcangeli's avatar
      mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9
      Andrea Arcangeli authored
      change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
      right away.  do_huge_pmd_numa_page() flushes the TLB before calling
      migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
      runs some CPU could still access the page through the TLB.
      
      change_huge_pmd() before arming the numa/protnone transhuge pmd calls
      mmu_notifier_invalidate_range_start().  So there's no need of
      mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
      sequence in migrate_misplaced_transhuge_page() too, because by the time
      migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
      invalidated in the secondary MMUs.  It has to or if a secondary MMU can
      still write to the page, the migrate_page_copy() would lose data.
      
      However an explicit mmu_notifier_invalidate_range() is needed before
      migrate_misplaced_transhuge_page() starts copying the data of the
      transhuge page or the below can happen for MMU notifier users sharing the
      primary MMU pagetables and only implementing ->invalidate_range:
      
      CPU0		CPU1		GPU sharing linux pagetables using
                                      only ->invalidate_range
      -----------	------------	---------
      				GPU secondary MMU writes to the page
      				mapped by the transhuge pmd
      change_pmd_range()
      mmu..._range_start()
      ->invalidate_range_start() noop
      change_huge_pmd()
      set_pmd_at(numa/protnone)
      pmd_unlock()
      		do_huge_pmd_numa_page()
      		CPU TLB flush globally (1)
      		CPU cannot write to page
      		migrate_misplaced_transhuge_page()
      				GPU writes to the page...
      		migrate_page_copy()
      				...GPU stops writing to the page
      CPU TLB flush (2)
      mmu..._range_end() (3)
      ->invalidate_range_stop() noop
      ->invalidate_range()
      				GPU secondary MMU is invalidated
      				and cannot write to the page anymore
      				(too late)
      
      Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
      too late, we also need a mmu_notifier_invalidate_range() before calling
      migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
      (3) also arrives too late.
      
      This requirement is the result of the lazy optimization in
      change_huge_pmd() that releases the pmd_lock without first flushing the
      TLB and without first calling mmu_notifier_invalidate_range().
      
      Even converting the removed mmu_notifier_invalidate_range_only_end() into
      a mmu_notifier_invalidate_range_end() would not have been enough to fix
      this, because it run after migrate_page_copy().
      
      After the hugepage data copy is done migrate_misplaced_transhuge_page()
      can proceed and call set_pmd_at without having to flush the TLB nor any
      secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
      flush, has to happen before the migrate_page_copy() is called or it would
      be a bug in the first place (and it was for drivers using
      ->invalidate_range()).
      
      KVM is unaffected because it doesn't implement ->invalidate_range().
      
      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
      uses the generic migrate_pages which transitions the pte from
      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
      and all mmu notifiers there before copying the page.
      
      Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7066f0f9
    • Andrea Arcangeli's avatar
      mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · d7c33934
      Andrea Arcangeli authored
      Patch series "migrate_misplaced_transhuge_page race conditions".
      
      Aaron found a new instance of the THP MADV_DONTNEED race against
      pmdp_clear_flush* variants, that was apparently left unfixed.
      
      While looking into the race found by Aaron, I may have found two more
      issues in migrate_misplaced_transhuge_page.
      
      These race conditions would not cause kernel instability, but they'd
      corrupt userland data or leave data non zero after MADV_DONTNEED.
      
      I did only minor testing, and I don't expect to be able to reproduce this
      (especially the lack of ->invalidate_range before migrate_page_copy,
      requires the latest iommu hardware or infiniband to reproduce).  The last
      patch is noop for x86 and it needs further review from maintainers of
      archs that implement flush_cache_range() (not in CC yet).
      
      To avoid confusion, it's not the first patch that introduces the bug fixed
      in the second patch, even before removing the
      pmdp_huge_clear_flush_notify, that _notify suffix was called after
      migrate_page_copy already run.
      
      This patch (of 3):
      
      This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
      balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
      race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
      race).
      
      When the above three fixes where posted Dave asked
      https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
      but apparently this was missed.
      
      The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
      in a54a407f ("mm: Close races between THP migration and PMD numa
      clearing").
      
      The important part of such commit is only the part where the page lock is
      not released until the first do_huge_pmd_numa_page() finished disarming
      the pagenuma/protnone.
      
      The addition of pmdp_clear_flush() wasn't beneficial to such commit and
      there's no commentary about such an addition either.
      
      I guess the pmdp_clear_flush() in such commit was added just in case for
      safety, but it ended up introducing the MADV_DONTNEED race condition found
      by Aaron.
      
      At that point in time nobody thought of such kind of MADV_DONTNEED race
      conditions yet (they were fixed later) so the code may have looked more
      robust by adding the pmdp_clear_flush().
      
      This specific race condition won't destabilize the kernel, but it can
      confuse userland because after MADV_DONTNEED the memory won't be zeroed
      out.
      
      This also optimizes the code and removes a superfluous TLB flush.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
      Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7c33934
    • Johannes Weiner's avatar
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner authored
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarDaniel Drake <drake@endlessm.com>
      Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
  2. 21 Oct, 2018 1 commit
  3. 05 Oct, 2018 2 commits
  4. 02 Oct, 2018 2 commits
    • Mel Gorman's avatar
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman authored
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      efaffc5e
    • Srikar Dronamraju's avatar
      mm/migrate: Use spin_trylock() while resetting rate limit · 75346121
      Srikar Dronamraju authored
      Since this spinlock will only serialize the migrate rate limiting,
      convert the spin_lock() to a spin_trylock(). If another thread is updating, this
      task can move on.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     205332  198512   -3.32145
      1     319785  313559   -1.94693
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      8     74912   74761.9  -0.200368
      1     206585  214874   4.01239
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     189162  180536   -4.56011
      1     213760  210281   -1.62753
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58736.8  56511.4  -3.78877
      1     105419   104899   -0.49327
      
      Avoiding stretching of window intervals may be the reason for the
      regression. Also code now uses READ_ONCE/WRITE_ONCE. That may
      also be hurting performance to some extent.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        14,285,708      13,818,546
      migrations                1,180,621       1,149,960
      faults                    339,114         385,583
      cache-misses              55,205,631,894  55,259,546,768
      sched:sched_move_numa     843             2,257
      sched:sched_stick_numa    6               9
      sched:sched_swap_numa     219             512
      migrate:mm_migrate_pages  365             2,225
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        26907   72692
      numa_hint_faults_local  24279   62270
      numa_hit                239771  238762
      numa_huge_pte_updates   0       48
      numa_interleave         68      75
      numa_local              239688  238676
      numa_other              83      86
      numa_pages_migrated     363     2225
      numa_pte_updates        27415   98557
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,202,779       3,173,490
      migrations                37,186          36,966
      faults                    106,076         108,776
      cache-misses              12,024,873,744  12,200,075,320
      sched:sched_move_numa     931             1,264
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               0
      migrate:mm_migrate_pages  637             899
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        17409   21109
      numa_hint_faults_local  14367   17120
      numa_hit                73953   72934
      numa_huge_pte_updates   20      42
      numa_interleave         25      33
      numa_local              73892   72866
      numa_other              61      68
      numa_pages_migrated     668     915
      numa_pte_updates        27276   42326
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,474,013    8,312,022
      migrations                254,934      231,705
      faults                    320,506      310,242
      cache-misses              110,580,458  402,324,573
      sched:sched_move_numa     725          193
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     7            3
      migrate:mm_migrate_pages  145          93
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        22797   11838
      numa_hint_faults_local  21539   11216
      numa_hit                89308   90689
      numa_huge_pte_updates   0       0
      numa_interleave         865     1579
      numa_local              88955   89634
      numa_other              353     1055
      numa_pages_migrated     149     92
      numa_pte_updates        22930   12109
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,195,628  2,170,481
      migrations                11,179     10,126
      faults                    149,656    160,962
      cache-misses              8,117,515  10,834,845
      sched:sched_move_numa     49         10
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  5          2
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3577    403
      numa_hint_faults_local  3476    358
      numa_hit                26142   25898
      numa_huge_pte_updates   0       0
      numa_interleave         358     207
      numa_local              26042   25860
      numa_other              100     38
      numa_pages_migrated     5       2
      numa_pte_updates        3587    400
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        100,602,296      110,339,633
      migrations                4,135,630        4,139,812
      faults                    789,256          863,622
      cache-misses              226,160,621,058  231,838,045,660
      sched:sched_move_numa     1,366            2,196
      sched:sched_stick_numa    16               33
      sched:sched_swap_numa     374              544
      migrate:mm_migrate_pages  1,350            2,469
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        47857   85748
      numa_hint_faults_local  39768   66831
      numa_hit                240165  242213
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240165  242211
      numa_other              0       2
      numa_pages_migrated     1224    2376
      numa_pte_updates        48354   86233
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        58,515,496      59,331,057
      migrations                564,845         552,019
      faults                    245,807         266,586
      cache-misses              73,603,757,976  73,796,312,990
      sched:sched_move_numa     996             981
      sched:sched_stick_numa    10              54
      sched:sched_swap_numa     193             286
      migrate:mm_migrate_pages  646             713
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        13422   14807
      numa_hint_faults_local  5619    5738
      numa_hit                36118   36230
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36116   36228
      numa_other              2       2
      numa_pages_migrated     616     703
      numa_pte_updates        13374   14742
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      75346121
  5. 24 Aug, 2018 2 commits
    • Naoya Horiguchi's avatar
      mm: soft-offline: close the race against page allocation · d4ae9916
      Naoya Horiguchi authored
      A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
      allocate a page that was just freed on the way of soft-offline.  This is
      undesirable because soft-offline (which is about corrected error) is
      less aggressive than hard-offline (which is about uncorrected error),
      and we can make soft-offline fail and keep using the page for good
      reason like "system is busy."
      
      Two main changes of this patch are:
      
      - setting migrate type of the target page to MIGRATE_ISOLATE. As done
        in free_unref_page_commit(), this makes kernel bypass pcplist when
        freeing the page. So we can assume that the page is in freelist just
        after put_page() returns,
      
      - setting PG_hwpoison on free page under zone->lock which protects
        freelists, so this allows us to avoid setting PG_hwpoison on a page
        that is decided to be allocated soon.
      
      [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
      Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4ae9916
    • Naoya Horiguchi's avatar
      mm: fix race on soft-offlining free huge pages · 6bc9b564
      Naoya Horiguchi authored
      Patch series "mm: soft-offline: fix race against page allocation".
      
      Xishi recently reported the issue about race on reusing the target pages
      of soft offlining.  Discussion and analysis showed that we need make
      sure that setting PG_hwpoison should be done in the right place under
      zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
      hanldes free buddy page's case.
      
      This patch (of 2):
      
      There's a race condition between soft offline and hugetlb_fault which
      causes unexpected process killing and/or hugetlb allocation failure.
      
      The process killing is caused by the following flow:
      
        CPU 0               CPU 1              CPU 2
      
        soft offline
          get_any_page
          // find the hugetlb is free
                            mmap a hugetlb file
                            page fault
                              ...
                                hugetlb_fault
                                  hugetlb_no_page
                                    alloc_huge_page
                                    // succeed
            soft_offline_free_page
            // set hwpoison flag
                                               mmap the hugetlb file
                                               page fault
                                                 ...
                                                   hugetlb_fault
                                                     hugetlb_no_page
                                                       find_lock_page
                                                         return VM_FAULT_HWPOISON
                                                 mm_fault_error
                                                   do_sigbus
                                                   // kill the process
      
      The hugetlb allocation failure comes from the following flow:
      
        CPU 0                          CPU 1
      
                                       mmap a hugetlb file
                                       // reserve all free page but don't fault-in
        soft offline
          get_any_page
          // find the hugetlb is free
            soft_offline_free_page
            // set hwpoison flag
              dissolve_free_huge_page
              // fail because all free hugepages are reserved
                                       page fault
                                         ...
                                           hugetlb_fault
                                             hugetlb_no_page
                                               alloc_huge_page
                                                 ...
                                                   dequeue_huge_page_node_exact
                                                   // ignore hwpoisoned hugepage
                                                   // and finally fail due to no-mem
      
      The root cause of this is that current soft-offline code is written based
      on an assumption that PageHWPoison flag should be set at first to avoid
      accessing the corrupted data.  This makes sense for memory_failure() or
      hard offline, but does not for soft offline because soft offline is about
      corrected (not uncorrected) error and is safe from data lost.  This patch
      changes soft offline semantics where it sets PageHWPoison flag only after
      containment of the error page completes successfully.
      
      Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Suggested-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bc9b564
  6. 23 Aug, 2018 1 commit
  7. 17 Aug, 2018 1 commit
  8. 12 May, 2018 1 commit
  9. 21 Apr, 2018 2 commits
    • Naoya Horiguchi's avatar
      mm: enable thp migration for shmem thp · e71769ae
      Naoya Horiguchi authored
      My testing for the latest kernel supporting thp migration showed an
      infinite loop in offlining the memory block that is filled with shmem
      thps.  We can get out of the loop with a signal, but kernel should return
      with failure in this case.
      
      What happens in the loop is that scan_movable_pages() repeats returning
      the same pfn without any progress.  That's because page migration always
      fails for shmem thps.
      
      In memory offline code, memory blocks containing unmovable pages should be
      prevented from being offline targets by has_unmovable_pages() inside
      start_isolate_page_range().  So it's possible to change migratability for
      non-anonymous thps to avoid the issue, but it introduces more complex and
      thp-specific handling in migration code, so it might not good.
      
      So this patch is suggesting to fix the issue by enabling thp migration for
      shmem thp.  Both of anon/shmem thp are migratable so we don't need
      precheck about the type of thps.
      
      Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
      Fixes: commit 72b39cfc ("mm, memory_hotplug: do not fail offlining too early")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <zi.yan@sent.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e71769ae
    • Michal Hocko's avatar
      mm: fix do_pages_move status handling · 8f175cf5
      Michal Hocko authored
      Li Wang has reported that LTP move_pages04 test fails with the current
      tree:
      
      LTP move_pages04:
         TFAIL  :  move_pages04.c:143: status[1] is EPERM, expected EFAULT
      
      The test allocates an array of two pages, one is present while the other
      is not (resp.  backed by zero page) and it expects EFAULT for the second
      page as the man page suggests.  We are reporting EPERM which doesn't make
      any sense and this is a result of a bug from cf5f16b23ec9 ("mm: unclutter
      THP migration").
      
      do_pages_move tries to handle as many pages in one batch as possible so we
      queue all pages with the same node target together and that corresponds to
      [start, i] range which is then used to update status array.
      add_page_for_migration will correctly notice the zero (resp.  !present)
      page and returns with EFAULT which gets written to the status.  But if
      this is the last page in the array we do not update start and so the last
      store_status after the loop will overwrite the range of the last batch
      with NUMA_NO_NODE (which corresponds to EPERM).
      
      Fix this by simply bailing out from the last flush if the pagelist is
      empty as there is clearly nothing more to do.
      
      Link: http://lkml.kernel.org/r/20180418121255.334-1-mhocko@kernel.org
      Fixes: cf5f16b23ec9 ("mm: unclutter THP migration")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarLi Wang <liwang@redhat.com>
      Tested-by: default avatarLi Wang <liwang@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f175cf5
  10. 11 Apr, 2018 6 commits
    • Matthew Wilcox's avatar
      page cache: use xa_lock · b93b0163
      Matthew Wilcox authored
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • Michal Hocko's avatar
      mm: unclutter THP migration · 94723aaf
      Michal Hocko authored
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • Michal Hocko's avatar
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko authored
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
    • Michal Hocko's avatar
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko authored
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
    • Ralph Campbell's avatar
      mm/migrate: properly preserve write attribute in special migrate entry · 07707125
      Ralph Campbell authored
      Use of pte_write(pte) is only valid for present pte, the common code
      which set the migration entry can be reach for both valid present pte
      and special swap entry (for device memory).  Fix the code to use the
      mpfn value which properly handle both cases.
      
      On x86 this did not have any bad side effect because pte write bit is
      below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
      in turn means we were always creating read only special migration entry.
      
      So once migration did finish we always write protected the CPU page
      table entry (moreover this is only an issue when migrating from device
      memory to system memory).  End effect is that CPU write access would
      fault again and restore write permission.
      
      This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
      take a second fault on write access. ie, double faulting the same
      address.  There is no corruption or incorrect states (it behaves as a
      COWed page from a fork with a mapcount of 1).
      
      Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.comSigned-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07707125
    • Mel Gorman's avatar
      sched/numa: avoid trapping faults and attempting migration of file-backed dirty pages · 09a913a7
      Mel Gorman authored
      change_pte_range is called from task work context to mark PTEs for
      receiving NUMA faulting hints.  If the marked pages are dirty then
      migration may fail.  Some filesystems cannot migrate dirty pages without
      blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
      Even when they can, it can be a waste of cycles when the pages are
      shared forcing higher scan rates.  This patch avoids marking shared
      dirty pages for hinting faults but also will skip a migration if the
      page was dirtied after the scanner updated a clean page.
      
      This is most noticeable running the NASA Parallel Benchmark when backed
      by btrfs, the default root filesystem for some distributions, but also
      noticeable when using XFS.
      
      The following are results from a 4-socket machine running a 4.16-rc4
      kernel with some scheduler patches that are pending for the next merge
      window.
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1
        Time cg.D      459.07 (   0.00%)      444.21 (   3.24%)
        Time ep.D       76.96 (   0.00%)       77.69 (  -0.95%)
        Time is.D       25.55 (   0.00%)       27.85 (  -9.00%)
        Time lu.D      601.58 (   0.00%)      596.87 (   0.78%)
        Time mg.D      107.73 (   0.00%)      108.22 (  -0.45%)
      
      is.D regresses slightly in terms of absolute time but note that that
      particular load varies quite a bit from run to run.  The more relevant
      observation is the total system CPU usage.
      
                  4.16.0-rc4  4.16.0-rc4
                schedtip-20180309 nodirty-v1
        User        71471.91    70627.04
        System      11078.96     8256.13
        Elapsed       661.66      632.74
      
      That is a substantial drop in system CPU usage and overall the workload
      completes faster.  The NUMA balancing statistics are also interesting
      
        NUMA base PTE updates        111407972   139848884
        NUMA huge PMD updates           206506      264869
        NUMA page range updates      217139044   275461812
        NUMA hint faults               4300924     3719784
        NUMA hint local faults         3012539     3416618
        NUMA hint local percent             70          91
        NUMA pages migrated            1517487     1358420
      
      While more PTEs are scanned due to changes in what faults are gathered,
      it's clear that a far higher percentage of faults are local as the bulk
      of the remote hits were dirty pages that, in this case with btrfs, had
      no chance of migrating.
      
      The following is a comparison when using XFS as that is a more realistic
      filesystem choice for a data partition
      
                              4.16.0-rc4             4.16.0-rc4
                       schedtip-20180309          nodirty-v1r47
        Time cg.D      485.28 (   0.00%)      442.62 (   8.79%)
        Time ep.D       77.68 (   0.00%)       77.54 (   0.18%)
        Time is.D       26.44 (   0.00%)       24.79 (   6.24%)
        Time lu.D      597.46 (   0.00%)      597.11 (   0.06%)
        Time mg.D      142.65 (   0.00%)      105.83 (  25.81%)
      
      That is a reasonable gain on two relatively long-lived workloads.  While
      not presented, there is also a substantial drop in system CPu usage and
      the NUMA balancing stats show similar improvements in locality as btrfs
      did.
      
      Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a913a7
  11. 02 Apr, 2018 1 commit
  12. 01 Feb, 2018 1 commit
    • Michal Hocko's avatar
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko authored
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
  13. 29 Nov, 2017 1 commit
  14. 27 Nov, 2017 1 commit
    • Kirill A. Shutemov's avatar
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov authored
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  15. 16 Nov, 2017 1 commit
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid call to invalidate_range() in range_end() · 4645b9fe
      Jérôme Glisse authored
      This is an optimization patch that only affect mmu_notifier users which
      rely on the invalidate_range() callback.  This patch avoids calling that
      callback twice in a row from inside __mmu_notifier_invalidate_range_end
      
      Existing pattern (before this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_end()
              mmu_notifier_invalidate_range()
      
      New pattern (after this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_only_end()
      
      We call the invalidate_range callback after clearing the page table
      under the page table lock and we skip the call to invalidate_range
      inside the __mmu_notifier_invalidate_range_end() function.
      
      Idea from Andrea Arcangeli
      
      Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4645b9fe
  16. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  17. 13 Oct, 2017 1 commit
  18. 09 Sep, 2017 9 commits
    • Jérôme Glisse's avatar
      mm/hmm: avoid bloating arch that do not make use of HMM · 6b368cd4
      Jérôme Glisse authored
      This moves all new code including new page migration helper behind kernel
      Kconfig option so that there is no codee bloat for arch or user that do
      not want to use HMM or any of its associated features.
      
      arm allyesconfig (without all the patchset, then with and this patch):
         text	   data	    bss	    dec	    hex	filename
      83721896	46511131	27582964	157815991	96814b7	../without/vmlinux
      83722364	46511131	27582964	157816459	968168b	vmlinux
      
      [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
        Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
      [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
        Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b368cd4
    • Jérôme Glisse's avatar
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse authored
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • Jérôme Glisse's avatar
      mm/migrate: allow migrate_vma() to alloc new page on empty entry · 8315ada7
      Jérôme Glisse authored
      This allows callers of migrate_vma() to allocate new page for empty CPU
      page table entry (pte_none or back by zero page).  This is only for
      anonymous memory and it won't allow new page to be instanced if the
      userfaultfd is armed.
      
      This is useful to device driver that want to migrate a range of virtual
      address and would rather allocate new memory than having to fault later
      on.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8315ada7
    • Jérôme Glisse's avatar
      mm/migrate: support un-addressable ZONE_DEVICE page in migration · a5430dda
      Jérôme Glisse authored
      Allow to unmap and restore special swap entry of un-addressable
      ZONE_DEVICE memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5430dda
    • Jérôme Glisse's avatar
      mm/migrate: migrate_vma() unmap page from vma while collecting pages · 8c3328f1
      Jérôme Glisse authored
      Common case for migration of virtual address range is page are map only
      once inside the vma in which migration is taking place.  Because we
      already walk the CPU page table for that range we can directly do the
      unmap there and setup special migration swap entry.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c3328f1
    • Jérôme Glisse's avatar
      mm/migrate: new memory migration helper for use with device memory · 8763cb45
      Jérôme Glisse authored
      This patch add a new memory migration helpers, which migrate memory
      backing a range of virtual address of a process to different memory (which
      can be allocated through special allocator).  It differs from numa
      migration by working on a range of virtual address and thus by doing
      migration in chunk that can be large enough to use DMA engine or special
      copy offloading engine.
      
      Expected users are any one with heterogeneous memory where different
      memory have different characteristics (latency, bandwidth, ...).  As an
      example IBM platform with CAPI bus can make use of this feature to migrate
      between regular memory and CAPI device memory.  New CPU architecture with
      a pool of high performance memory not manage as cache but presented as
      regular memory (while being faster and with lower latency than DDR) will
      also be prime user of this patch.
      
      Migration to private device memory will be useful for device that have
      large pool of such like GPU, NVidia plans to use HMM for that.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: default avatarSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: default avatarSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8763cb45
    • Jérôme Glisse's avatar
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse authored
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
    • Naoya Horiguchi's avatar
      mm: migrate: move_pages() supports thp migration · e8db67eb
      Naoya Horiguchi authored
      This patch enables thp migration for move_pages(2).
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8db67eb
    • Zi Yan's avatar
      mm: thp: enable thp migration in generic path · 616b8371
      Zi Yan authored
      Add thp migration's core code, including conversions between a PMD entry
      and a swap entry, setting PMD migration entry, removing PMD migration
      entry, and waiting on PMD migration entries.
      
      This patch makes it possible to support thp migration.  If you fail to
      allocate a destination page as a thp, you just split the source thp as
      we do now, and then enter the normal page migration.  If you succeed to
      allocate destination thp, you enter thp migration.  Subsequent patches
      actually enable thp migration for each caller of page migration by
      allowing its get_new_page() callback to allocate thps.
      
      [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
        Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
      [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      616b8371
  19. 20 Aug, 2017 1 commit
    • Linus Torvalds's avatar
      Sanitize 'move_pages()' permission checks · 197e7e52
      Linus Torvalds authored
      The 'move_paghes()' system call was introduced long long ago with the
      same permission checks as for sending a signal (except using
      CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
      
      That turns out to not be a great choice - while the system call really
      only moves physical page allocations around (and you need other
      capabilities to do a lot of it), you can check the return value to map
      out some the virtual address choices and defeat ASLR of a binary that
      still shares your uid.
      
      So change the access checks to the more common 'ptrace_may_access()'
      model instead.
      
      This tightens the access checks for the uid, and also effectively
      changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
      anybody really _uses_ this legacy system call any more (we hav ebetter
      NUMA placement models these days), so I expect nobody to notice.
      
      Famous last words.
      Reported-by: default avatarOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      197e7e52
  20. 10 Aug, 2017 1 commit
    • Nadav Amit's avatar
      Revert "mm: numa: defer TLB flush for THP migration as long as possible" · a9b80250
      Nadav Amit authored
      While deferring TLB flushes is a good practice, the reverted patch
      caused pending TLB flushes to be checked while the page-table lock is
      not taken.  As a result, in architectures with weak memory model (PPC),
      Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
      and cause (in theory) a memory corruption.
      
      Since the alternative of using smp_mb__after_unlock_lock() was
      considered a bit open-coded, and the performance impact is expected to
      be small, the previous patch is reverted.
      
      This reverts b0943d61 ("mm: numa: defer TLB flush for THP migration
      as long as possible").
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9b80250