Skip to content
  • Andrea Arcangeli's avatar
    mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9
    Andrea Arcangeli authored
    change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
    right away.  do_huge_pmd_numa_page() flushes the TLB before calling
    migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
    runs some CPU could still access the page through the TLB.
    
    change_huge_pmd() before arming the numa/protnone transhuge pmd calls
    mmu_notifier_invalidate_range_start().  So there's no need of
    mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
    sequence in migrate_misplaced_transhuge_page() too, because by the time
    migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
    invalidated in the secondary MMUs.  It has to or if a secondary MMU can
    still write to the page, the migrate_page_copy() would lose data.
    
    However an explicit mmu_notifier_invalidate_range() is needed before
    migrate_misplaced_transhuge_page() starts copying the data of the
    transhuge page or the below can happen for MMU notifier users sharing the
    primary MMU pagetables and only implementing ->invalidate_range:
    
    CPU0		CPU1		GPU sharing linux pagetables using
                                    only ->invalidate_range
    -----------	------------	---------
    				GPU secondary MMU writes to the page
    				mapped by the transhuge pmd
    change_pmd_range()
    mmu..._range_start()
    ->invalidate_range_start() noop
    change_huge_pmd()
    set_pmd_at(numa/protnone)
    pmd_unlock()
    		do_huge_pmd_numa_page()
    		CPU TLB flush globally (1)
    		CPU cannot write to page
    		migrate_misplaced_transhuge_page()
    				GPU writes to the page...
    		migrate_page_copy()
    				...GPU stops writing to the page
    CPU TLB flush (2)
    mmu..._range_end() (3)
    ->invalidate_range_stop() noop
    ->invalidate_range()
    				GPU secondary MMU is invalidated
    				and cannot write to the page anymore
    				(too late)
    
    Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
    too late, we also need a mmu_notifier_invalidate_range() before calling
    migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
    (3) also arrives too late.
    
    This requirement is the result of the lazy optimization in
    change_huge_pmd() that releases the pmd_lock without first flushing the
    TLB and without first calling mmu_notifier_invalidate_range().
    
    Even converting the removed mmu_notifier_invalidate_range_only_end() into
    a mmu_notifier_invalidate_range_end() would not have been enough to fix
    this, because it run after migrate_page_copy().
    
    After the hugepage data copy is done migrate_misplaced_transhuge_page()
    can proceed and call set_pmd_at without having to flush the TLB nor any
    secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
    flush, has to happen before the migrate_page_copy() is called or it would
    be a bug in the first place (and it was for drivers using
    ->invalidate_range()).
    
    KVM is unaffected because it doesn't implement ->invalidate_range().
    
    The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
    uses the generic migrate_pages which transitions the pte from
    numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
    and all mmu notifiers there before copying the page.
    
    Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com
    
    
    Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    7066f0f9