Skip to content
  • Ross Zwisler's avatar
    dax: fix deadlock due to misaligned PMD faults · fffa281b
    Ross Zwisler authored
    In DAX there are two separate places where the 2MiB range of a PMD is
    defined.
    
    The first is in the page tables, where a PMD mapping inserted for a
    given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
    PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
    address to the 2MiB boundary above the address.
    
    So, for example, a fault at address 3MiB (0x30 0000) falls within the
    PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
    
    The second PMD range is in the mapping->page_tree, where a given file
    offset is covered by a radix tree entry that spans from one 2MiB aligned
    file offset to another 2MiB aligned file offset.
    
    So, for example, the file offset for 3MiB (pgoff 768) falls within the
    PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
    512) to 4MiB (pgoff 1024).
    
    This system works so long as the addresses and file offsets for a given
    mapping both have the same offsets relative to the start of each PMD.
    
    Consider the case where the starting address for a given file isn't 2MiB
    aligned - say our faulting address is 3 MiB (0x30 0000), but that
    corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
    the mapping are misaligned so that the 2MiB range defined in the page
    tables never matches up with the 2MiB range defined in the radix tree.
    
    The current code notices this case for DAX faults to storage with the
    following test in dax_pmd_insert_mapping():
    
    	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
    		goto unlock_fallback;
    
    This test makes sure that the pfn we get from the driver is 2MiB
    aligned, and relies on the assumption that the 2MiB alignment of the pfn
    we get back from the driver matches the 2MiB alignment of the faulting
    address.
    
    However, faults to holes were not checked and we could hit the problem
    described above.
    
    This was reported in response to the NVML nvml/src/test/pmempool_sync
    TEST5:
    
    	$ cd nvml/src/test/pmempool_sync
    	$ make TEST5
    
    You can grab NVML here:
    
    	https://github.com/pmem/nvml/
    
    The dmesg warning you see when you hit this error is:
    
      WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
    
    Where we notice in dax_insert_mapping_entry() that the radix tree entry
    we are about to replace doesn't match the locked entry that we had
    previously inserted into the tree.  This happens because the initial
    insertion was done in grab_mapping_entry() using a pgoff calculated from
    the faulting address (vmf->address), and the replacement in
    dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
    vmf->pgoff.
    
    In our failure case those two page offsets (one calculated from
    vmf->address, one using vmf->pgoff) point to different order 9 radix
    tree entries.
    
    This failure case can result in a deadlock because the radix tree unlock
    also happens on the pgoff calculated from vmf->address.  This means that
    the locked radix tree entry that we swapped in to the tree in
    dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
    future faults to that 2MiB range will block forever.
    
    Fix this by validating that the faulting address's PMD offset matches
    the PMD offset from the start of the file.  This check is done at the
    very beginning of the fault and covers faults that would have mapped to
    storage as well as faults to holes.  I left the COLOUR check in
    dax_pmd_insert_mapping() in place in case we ever hit the insanity
    condition where the alignment of the pfn we get from the driver doesn't
    match the alignment of the userspace address.
    
    Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
    
    
    Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
    Reported-by: default avatar"Slusarz, Marcin" <marcin.slusarz@intel.com>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    fffa281b