Skip to content
  • Andrea Arcangeli's avatar
    userfaultfd: fix SIGBUS resulting from false rwsem wakeups · 15a77c6f
    Andrea Arcangeli authored
    With >=32 CPUs the userfaultfd selftest triggered a graceful but
    unexpected SIGBUS because VM_FAULT_RETRY was returned by
    handle_userfault() despite the UFFDIO_COPY wasn't completed.
    
    This seems caused by rwsem waking the thread blocked in
    handle_userfault() and we can't run up_read() before the wait_event
    sequence is complete.
    
    Keeping the wait_even sequence identical to the first one, would require
    running userfaultfd_must_wait() again to know if the loop should be
    repeated, and it would also require retaking the rwsem and revalidating
    the whole vma status.
    
    It seems simpler to wait the targeted wakeup so that if false wakeups
    materialize we still wait for our specific wakeup event, unless of
    course there are signals or the uffd was released.
    
    Debug code collecting the stack trace of the wakeup showed this:
    
      $ ./userfaultfd 100 99999
      nr_pages: 25600, nr_pages_per_cpu: 800
      bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
      bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
    
        save_stack_trace+0x2b/0x50
        try_to_wake_up+0x2a6/0x580
        wake_up_q+0x32/0x70
        rwsem_wake+0xe0/0x120
        call_rwsem_wake+0x1b/0x30
        up_write+0x3b/0x40
        vm_mmap_pgoff+0x9c/0xc0
        SyS_mmap_pgoff+0x1a9/0x240
        SyS_mmap+0x22/0x30
        entry_SYSCALL_64_fastpath+0x1f/0xbd
        0xffffffffffffffff
        FAULT_FLAG_ALLOW_RETRY missing 70
      CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
        dump_stack+0xb8/0x112
        handle_userfault+0x572/0x650
        handle_mm_fault+0x12cb/0x1520
        __do_page_fault+0x175/0x500
        trace_do_page_fault+0x61/0x270
        do_async_page_fault+0x19/0x90
        async_page_fault+0x25/0x30
    
    This always happens when the main userfault selftest thread is running
    clone() while glibc runs either mprotect or mmap (both taking mmap_sem
    down_write()) to allocate the thread stack of the background threads,
    while locking/userfault threads already run at full throttle and are
    susceptible to false wakeups that may cause handle_userfault() to return
    before than expected (which results in graceful SIGBUS at the next
    attempt).
    
    This was reproduced only with >=32 CPUs because the loop to start the
    thread where clone() is too quick with fewer CPUs, while with 32 CPUs
    there's already significant activity on ~32 locking and userfault
    threads when the last background threads are started with clone().
    
    This >=32 CPUs SMP race condition is likely reproducible only with the
    selftest because of the much heavier userfault load it generates if
    compared to real apps.
    
    We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
    patch floating around that provides it also hidden this problem but in
    reality only is successfully at hiding the problem.
    
    False wakeups could still happen again the second time
    handle_userfault() is invoked, even if it's a so rare race condition
    that getting false wakeups twice in a row is impossible to reproduce.
    This full fix is needed for correctness, the only alternative would be
    to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
    support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
    of returning it infinite times to avoid the SIGBUS).
    
    Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
    
    
    Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
    Reported-by: default avatarShubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
    Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
    Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
    Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
    Cc: Pavel Emelyanov <xemul@parallels.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    15a77c6f