1. 23 Feb, 2016 26 commits
  2. 19 Feb, 2016 4 commits
    • Jan Kara's avatar
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara authored
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    • Jan Kara's avatar
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara authored
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      Fix the problem by always updating bh->b_state bits atomically.
      CC: stable@vger.kernel.org
      Reported-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    • Jeff Layton's avatar
      fsnotify: turn fsnotify reaper thread into a workqueue job · 0918f1c3
      Jeff Layton authored
      We don't require a dedicated thread for fsnotify cleanup.  Switch it
      over to a workqueue job instead that runs on the system_unbound_wq.
      In the interest of not thrashing the queued job too often when there are
      a lot of marks being removed, we delay the reaper job slightly when
      queueing it, to allow several to gather on the list.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Tested-by: default avatarEryu Guan <guaneryu@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Jeff Layton's avatar
      Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" · 13d34ac6
      Jeff Layton authored
      This reverts commit c510eff6
       ("fsnotify: destroy marks with
      call_srcu instead of dedicated thread").
      Eryu reported that he was seeing some OOM kills kick in when running a
      testcase that adds and removes inotify marks on a file in a tight loop.
      The above commit changed the code to use call_srcu to clean up the
      marks.  While that does (in principle) work, the srcu callback job is
      limited to cleaning up entries in small batches and only once per jiffy.
      It's easily possible to overwhelm that machinery with too many call_srcu
      callbacks, and Eryu's reproduer did just that.
      There's also another potential problem with using call_srcu here.  While
      you can obviously sleep while holding the srcu_read_lock, the callbacks
      run under local_bh_disable, so you can't sleep there.
      It's possible when putting the last reference to the fsnotify_mark that
      we'll end up putting a chain of references including the fsnotify_group,
      uid, and associated keys.  While I don't see any obvious ways that that
      could occurs, it's probably still best to avoid using call_srcu here
      after all.
      This patch reverts the above patch.  A later patch will take a different
      approach to eliminated the dedicated thread here.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Reported-by: default avatarEryu Guan <guaneryu@gmail.com>
      Tested-by: default avatarEryu Guan <guaneryu@gmail.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 16 Feb, 2016 4 commits
    • Tahsin Erdogan's avatar
      writeback: initialize inode members that track writeback history · 3d65ae46
      Tahsin Erdogan authored
      inode struct members that track cgroup writeback information
      should be reinitialized when inode gets allocated from
      kmem_cache. Otherwise, their values remain and get used by the
      new inode.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: d10c8095
       ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      writeback: keep superblock pinned during cgroup writeback association switches · 5ff8eaac
      Tejun Heo authored
      If cgroup writeback is in use, an inode is associated with a cgroup
      for writeback.  If the inode's main dirtier changes to another cgroup,
      the association gets updated asynchronously.  Nothing was pinning the
      superblock while such switches are in progress and superblock could go
      away while async switching is pending or in progress leading to
      crashes like the following.
       kernel BUG at fs/jbd2/transaction.c:319!
       invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
       Hardware name: Google Google, BIOS Google 01/01/2011
       Workqueue: events inode_switch_wbs_work_fn
       task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
       RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
       RSP: 0018:ffff880209267c30  EFLAGS: 00010202
       Call Trace:
        [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
        [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
        [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
        [<ffffffff8035338b>] evict+0xbb/0x190
        [<ffffffff80354190>] iput+0x130/0x190
        [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
        [<ffffffff80279819>] process_one_work+0x129/0x300
        [<ffffffff80279b16>] worker_thread+0x126/0x480
        [<ffffffff8027ed14>] kthread+0xc4/0xe0
        [<ffffffff809771df>] ret_from_fork+0x3f/0x70
      Fix it by bumping s_active while cgroup association switching is in
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarTahsin Erdogan <tahsin@google.com>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: d10c8095
       ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Cc: stable@vger.kernel.org #v4.5+
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Kirill Tkhai's avatar
      ext4: fix memleak in ext4_readdir() · c906f38e
      Kirill Tkhai authored
      When ext4_bread() fails, fname_crypto_str remains
      allocated after return. Fix that.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
    • Filipe Manana's avatar
      Btrfs: fix direct IO requests not reporting IO error to user space · 1636d1d7
      Filipe Manana authored
      If a bio for a direct IO request fails, we were not setting the error in
      the parent bio (the main DIO bio), making us not return the error to
      user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
      return the number of bytes issued for IO and not the error a bio created
      and submitted by btrfs_submit_direct() got from the block layer.
      This essentially happens because when we call:
         dio_end_io(dio_bio, bio->bi_error);
      It does not set dio_bio->bi_error to the value of the second argument.
      So just add this missing assignment in endio callbacks, just as we do in
      the error path at btrfs_submit_direct() when we fail to clone the dio bio
      or allocate its private object. This follows the convention of what is
      done with other similar APIs such as bio_endio() where the caller is
      responsible for setting the bi_error field in the bio it passes as an
      argument to bio_endio().
      This was detected by the new generic test cases in xfstests: 271, 272,
      276 and 278. Which essentially setup a dm error target, then load the
      error table, do a direct IO write and unload the error table. They
      expect the write to fail with -EIO, which was not getting reported
      when testing against btrfs.
      Cc: stable@vger.kernel.org  # 4.3+
      Fixes: 4246a0b6
       ("block: add a bi_error field to struct bio")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
  4. 12 Feb, 2016 6 commits
    • Eryu Guan's avatar
      ext4: remove unused parameter "newblock" in convert_initialized_extent() · 56263b4c
      Eryu Guan authored
      The "newblock" parameter is not used in convert_initialized_extent(),
      remove it.
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    • Eryu Guan's avatar
      ext4: don't read blocks from disk after extents being swapped · bcff2488
      Eryu Guan authored
      I notice ext4/307 fails occasionally on ppc64 host, reporting md5
      checksum mismatch after moving data from original file to donor file.
      The reason is that move_extent_per_page() calls __block_write_begin()
      and block_commit_write() to write saved data from original inode blocks
      to donor inode blocks, but __block_write_begin() not only maps buffer
      heads but also reads block content from disk if the size is not block
      size aligned.  At this time the physical block number in mapped buffer
      head is pointing to the donor file not the original file, and that
      results in reading wrong data to page, which get written to disk in
      following block_commit_write call.
      This also can be reproduced by the following script on 1k block size ext4
      on x86_64 host:
          rm -f $donorfile $testfile
          # reserve space for donor file, written by 0xaa and sync to disk to
          # avoid EBUSY on EXT4_IOC_MOVE_EXT
          xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
          # create test file written by 0xbb
          xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
          # compute initial md5sum
          md5sum $testfile | tee md5sum.txt
          # drop cache, force e4compact to read data from disk
          echo 3 > /proc/sys/vm/drop_caches
          # test defrag
          echo "$testfile" | $e4compact -i -v -f $donorfile
          # check md5sum
          md5sum -c md5sum.txt
      Fix it by creating & mapping buffer heads only but not reading blocks
      from disk, because all the data in page is guaranteed to be up-to-date
      in mext_page_mkuptodate().
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    • Insu Yun's avatar
      ext4: fix potential integer overflow · 46901760
      Insu Yun authored
      Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
      integer overflow could be happened.
      Therefore, need to fix integer overflow sanitization.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarInsu Yun <wuninsu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    • Huaitong Han's avatar
      ext4: add a line break for proc mb_groups display · 802cf1f9
      Huaitong Han authored
      This patch adds a line break for proc mb_groups display.
      Signed-off-by: default avatarHuaitong Han <huaitong.han@intel.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
    • Anton Protopopov's avatar
      ext4: ioctl: fix erroneous return value · fdde368e
      Anton Protopopov authored
      The ext4_ioctl_setflags() function which is used in the ioctls
      EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value
      EPERM instead of -EPERM in case of error. This bug was introduced by a
      recent commit 9b7365fc
      The following program can be used to illustrate the wrong behavior:
          #include <sys/types.h>
          #include <sys/ioctl.h>
          #include <sys/stat.h>
          #include <fcntl.h>
          #include <err.h>
          #define FS_IOC_GETFLAGS _IOR('f', 1, long)
          #define FS_IOC_SETFLAGS _IOW('f', 2, long)
          #define FS_IMMUTABLE_FL 0x00000010
          int main(void)
              int fd;
              long flags;
              fd = open("file", O_RDWR|O_CREAT, 0600);
              if (fd < 0)
                  err(1, "open");
              if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0)
                  err(1, "ioctl: FS_IOC_GETFLAGS");
              flags |= FS_IMMUTABLE_FL;
              if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0)
                  err(1, "ioctl: FS_IOC_SETFLAGS");
              warnx("ioctl returned no error");
              return 0;
      Running it gives the following result:
          $ strace -e ioctl ./test
          ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0
          ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1
          test: ioctl returned no error
          +++ exited with 0 +++
      Running the program on a kernel with the bug fixed gives the proper result:
          $ strace -e ioctl ./test
          ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0
          ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted)
          test: ioctl: FS_IOC_SETFLAGS: Operation not permitted
          +++ exited with 1 +++
      Signed-off-by: default avatarAnton Protopopov <a.s.protopopov@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    • Jan Kara's avatar
      ext4: fix scheduling in atomic on group checksum failure · 05145bd7
      Jan Kara authored
      When block group checksum is wrong, we call ext4_error() while holding
      group spinlock from ext4_init_block_bitmap() or
      ext4_init_inode_bitmap() which results in scheduling while in atomic.
      Fix the issue by calling ext4_error() later after dropping the spinlock.
      CC: stable@vger.kernel.org
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>