Skip to content
  • Kiran Kumar Modukuri's avatar
    cachefiles: Fix refcounting bug in backing-file read monitoring · 934140ab
    Kiran Kumar Modukuri authored
    cachefiles_read_waiter() has the right to access a 'monitor' object by
    virtue of being called under the waitqueue lock for one of the pages in its
    purview.  However, it has no ref on that monitor object or on the
    associated operation.
    
    What it is allowed to do is to move the monitor object to the operation's
    to_do list, but once it drops the work_lock, it's actually no longer
    permitted to access that object.  However, it is trying to enqueue the
    retrieval operation for processing - but it can only do this via a pointer
    in the monitor object, something it shouldn't be doing.
    
    If it doesn't enqueue the operation, the operation may not get processed.
    If the order is flipped so that the enqueue is first, then it's possible
    for the work processor to look at the to_do list before the monitor is
    enqueued upon it.
    
    Fix this by getting a ref on the operation so that we can trust that it
    will still be there once we've added the monitor to the to_do list and
    dropped the work_lock.  The op can then be enqueued after the lock is
    dropped.
    
    The bug can manifest in one of a couple of ways.  The first manifestation
    looks like:
    
     FS-Cache:
     FS-Cache: Assertion failed
     FS-Cache: 6 == 5 is false
     ------------[ cut here ]------------
     kernel BUG at fs/fscache/operation.c:494!
     RIP: 0010:fscache_put_operation+0x1e3/0x1f0
     ...
     fscache_op_work_func+0x26/0x50
     process_one_work+0x131/0x290
     worker_thread+0x45/0x360
     kthread+0xf8/0x130
     ? create_worker+0x190/0x190
     ? kthread_cancel_work_sync+0x10/0x10
     ret_from_fork+0x1f/0x30
    
    This is due to the operation being in the DEAD state (6) rather than
    INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
    fscache_put_operation().
    
    The bug can also manifest like the following:
    
     kernel BUG at fs/fscache/operation.c:69!
     ...
        [exception RIP: fscache_enqueue_operation+246]
     ...
     #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
     #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
     #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028
    
    I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
    entirely clear which assertion failed.
    
    Fixes: 9ae326a6
    
     ("CacheFiles: A cache that backs onto a mounted filesystem")
    Reported-by: default avatarLei Xue <carmark.dlut@gmail.com>
    Reported-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
    Reported-by: default avatarAnthony DeRobertis <aderobertis@metrics.net>
    Reported-by: default avatarNeilBrown <neilb@suse.com>
    Reported-by: default avatarDaniel Axtens <dja@axtens.net>
    Reported-by: default avatarKiran Kumar Modukuri <kiran.modukuri@gmail.com>
    Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    Reviewed-by: default avatarDaniel Axtens <dja@axtens.net>
    934140ab