1. 19 Aug, 2014 3 commits
  2. 15 Aug, 2014 9 commits
    • Chris Mason's avatar
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason authored
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Filipe Manana's avatar
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
      Filipe Manana authored
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      For example, consider the following scenario, where we have:
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
        Leaf N:
          slot = 0                           slot = btrfs_header_nritems() - 1
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        Leaf N + 1:
            slot = 0                          slot = btrfs_header_nritems() - 1
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Takashi Iwai's avatar
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d
      Takashi Iwai authored
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
      Cc: <stable@vger.kernel.org> [v3.11+]
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Liu Bo's avatar
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo authored
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Tested-By: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Mark Fasheh's avatar
      btrfs: correctly handle return from ulist_add · f90e579c
      Mark Fasheh authored
      ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
      doesn't take into account. As a result, that value can be bubbled up to
      callers, causing an error to be printed. Fix this by only returning the
      value of ulist_add() when it indicates an error.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Mark Fasheh's avatar
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh authored
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Filipe Manana's avatar
      Btrfs: read lock extent buffer while walking backrefs · 6f7ff6d7
      Filipe Manana authored
      Before processing the extent buffer, acquire a read lock on it, so
      that we're safe against concurrent updates on the extent buffer.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • Josef Bacik's avatar
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik authored
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
    • David Sterba's avatar
      btrfs: adjust statfs calculations according to raid profiles · ba7b6e62
      David Sterba authored
      This has been discussed in thread:
      and this patch implements this proposal:
      Works fine for "clean" raid profiles where the raid factor correction
      does the right job. Otherwise it's pessimistic and may show low space
      although there's still some left.
      The df nubmers are lightly wrong in case of mixed block groups, but this
      is not a major usecase and can be addressed later.
      The RAID56 numbers are wrong almost the same way as before and will be
      addressed separately.
      CC: Hugo Mills <hugo@carfax.org.uk>
      CC: cwillu <cwillu@cwillu.com>
      CC: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
  3. 14 Aug, 2014 3 commits
    • Jeff Layton's avatar
      locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock · 2dfb928f
      Jeff Layton authored
      There's no need to call locks_free_lock here while still holding the
      i_lock. Defer that until the lock has been dropped.
      Acked-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@primarydata.com>
    • Jeff Layton's avatar
      locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped · ed9814d8
      Jeff Layton authored
      In commit 72f98e72
       (locks: turn lock_flocks into a spinlock), we
      moved from using the BKL to a global spinlock. With this change, we lost
      the ability to block in the fl_release_private operation.
      This is problematic for NFS (and probably some other filesystems as
      well). Add a new list_head argument to locks_delete_lock. If that
      argument is non-NULL, then queue any locks that we want to free to the
      list instead of freeing them.
      Then, add a new locks_dispose_list function that will walk such a list
      and call locks_free_lock on them after the i_lock has been dropped.
      Finally, change all of the callers of locks_delete_lock to pass in a
      list_head, except for lease_modify. That function can be called long
      after the i_lock has been acquired. Deferring the freeing of a lease
      after unlocking it in that function is non-trivial until we overhaul
      some of the spinlocking in the lease code.
      Currently though, no filesystem that sets fl_release_private supports
      leases, so this is not currently a problem. We'll eventually want to
      make the same change in the lease code, but it needs a lot more work
      before we can reasonably do so.
      Acked-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@primarydata.com>
    • Jeff Layton's avatar
      locks: don't reuse file_lock in __posix_lock_file · b84d49f9
      Jeff Layton authored
      Currently in the case where a new file lock completely replaces the old
      one, we end up overwriting the existing lock with the new info. This
      means that we have to call fl_release_private inside i_lock. Change the
      code to instead copy the info to new_fl, insert that lock into the
      correct spot and then delete the old lock. In a later patch, we'll defer
      the freeing of the old lock until after the i_lock has been dropped.
      Acked-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@primarydata.com>
  4. 12 Aug, 2014 1 commit
    • Jan Kara's avatar
      reiserfs: Fix use after free in journal teardown · 01777836
      Jan Kara authored
      If do_journal_release() races with do_journal_end() which requeues
      delayed works for transaction flushing, we can leave work items for
      flushing outstanding transactions queued while freeing them. That
      results in use after free and possible crash in run_timers_softirq().
      Fix the problem by not requeueing works if superblock is being shut down
      (MS_ACTIVE not set) and using cancel_delayed_work_sync() in
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
  5. 11 Aug, 2014 4 commits
    • Jeff Layton's avatar
      locks: don't call locks_release_private from locks_copy_lock · 566709bd
      Jeff Layton authored
      All callers of locks_copy_lock pass in a brand new file_lock struct, so
      there's no need to call locks_release_private on it. Replace that with
      a warning that fires in the event that we receive a target lock that
      doesn't look like it's properly initialized.
      Acked-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@primarydata.com>
    • Jeff Layton's avatar
      locks: show delegations as "DELEG" in /proc/locks · 8144f1f6
      Jeff Layton authored
      Now that they are a distinct lease type, show them as such.
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@primarydata.com>
    • Al Viro's avatar
      fix copy_tree() regression · 12a5b529
      Al Viro authored
      Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
      vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
      on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
      by C), copy_tree() of A would make a copy of D' shadow the the copy of
      C, not the other way around.
      It's easy to fix, fortunately - just make sure that mount follows
      the one that shadows it in mnt_child as well as in mnt_hash, and when
      copy_tree() decides to attach a new mount, check if the last child
      it has added to the same parent should be shadowing the new one.
      And if it should, just use the same logics commit_tree() has - put the
      new mount into the hash and children lists right after the one that
      should shadow it.
      Cc: stable@vger.kernel.org [3.14 and later]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Linus Torvalds's avatar
      Revert "proc: Point /proc/{mounts,net} at /proc/thread-self/{mounts,net}... · 155134fe
      Linus Torvalds authored
      Revert "proc: Point /proc/{mounts,net} at /proc/thread-self/{mounts,net} instead of /proc/self/{mounts,net}"
      This reverts commits 344470ca and e8132440
      It turns out that the exact path in the symlink matters, if for somewhat
      unfortunate reasons: some apparmor configurations don't allow dhclient
      access to the per-thread /proc files.  As reported by Jörg Otte:
        audit: type=1400 audit(1407684227.003:28): apparmor="DENIED"
          operation="open" profile="/sbin/dhclient"
          name="/proc/1540/task/1540/net/dev" pid=1540 comm="dhclient"
          requested_mask="r" denied_mask="r" fsuid=0 ouid=0
      so we had better revert this for now.  We might be able to work around
      this in practice by only using the per-thread symlinks if the thread
      isn't the thread group leader, and if the namespaces differ between
      threads (which basically never happens).
      We'll see. In the meantime, the revert was made to be intentionally easy.
      Reported-by: default avatarJörg Otte <jrg.otte@gmail.com>
      Acked-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 08 Aug, 2014 20 commits