1. 22 Jun, 2018 1 commit
    • Filipe Manana's avatar
      Btrfs: fix return value on rename exchange failure · c5b4a50b
      Filipe Manana authored
      If we failed during a rename exchange operation after starting/joining a
      transaction, we would end up replacing the return value, stored in the
      local 'ret' variable, with the return value from btrfs_end_transaction().
      So this could end up returning 0 (success) to user space despite the
      operation having failed and aborted the transaction, because if there are
      multiple tasks having a reference on the transaction at the time
      btrfs_end_transaction() is called by the rename exchange, that function
      returns 0 (otherwise it returns -EIO and not the original error value).
      So fix this by not overwriting the return value on error after getting
      a transaction handle.
      
      Fixes: cdd1fedf
      
       ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c5b4a50b
  2. 07 Jun, 2018 1 commit
  3. 05 Jun, 2018 1 commit
    • Deepa Dinamani's avatar
      vfs: change inode times to use struct timespec64 · 95582b00
      Deepa Dinamani authored
      
      
      struct timespec is not y2038 safe. Transition vfs to use
      y2038 safe struct timespec64 instead.
      
      The change was made with the help of the following cocinelle
      script. This catches about 80% of the changes.
      All the header file and logic changes are included in the
      first 5 rules. The rest are trivial substitutions.
      I avoid changing any of the function signatures or any other
      filesystem specific data structures to keep the patch simple
      for review.
      
      The script can be a little shorter by combining different cases.
      But, this version was sufficient for my usecase.
      
      virtual patch
      
      @ depends on patch @
      identifier now;
      @@
      - struct timespec
      + struct timespec64
        current_time ( ... )
        {
      - struct timespec now = current_kernel_time();
      + struct timespec64 now = current_kernel_time64();
        ...
      - return timespec_trunc(
      + return timespec64_trunc(
        ... );
        }
      
      @ depends on patch @
      identifier xtime;
      @@
       struct \( iattr \| inode \| kstat \) {
       ...
      -       struct timespec xtime;
      +       struct timespec64 xtime;
       ...
       }
      
      @ depends on patch @
      identifier t;
      @@
       struct inode_operations {
       ...
      int (*update_time) (...,
      -       struct timespec t,
      +       struct timespec64 t,
      ...);
       ...
       }
      
      @ depends on patch @
      identifier t;
      identifier fn_update_time =~ "update_time$";
      @@
       fn_update_time (...,
      - struct timespec *t,
      + struct timespec64 *t,
       ...) { ... }
      
      @ depends on patch @
      identifier t;
      @@
      lease_get_mtime( ... ,
      - struct timespec *t
      + struct timespec64 *t
        ) { ... }
      
      @te depends on patch forall@
      identifier ts;
      local idexpression struct inode *inode_node;
      identifier i_xtime =~ "^i_[acm]time$";
      identifier ia_xtime =~ "^ia_[acm]time$";
      identifier fn_update_time =~ "update_time$";
      identifier fn;
      expression e, E3;
      local idexpression struct inode *node1;
      local idexpression struct inode *node2;
      local idexpression struct iattr *attr1;
      local idexpression struct iattr *attr2;
      local idexpression struct iattr attr;
      identifier i_xtime1 =~ "^i_[acm]time$";
      identifier i_xtime2 =~ "^i_[acm]time$";
      identifier ia_xtime1 =~ "^ia_[acm]time$";
      identifier ia_xtime2 =~ "^ia_[acm]time$";
      @@
      (
      (
      - struct timespec ts;
      + struct timespec64 ts;
      |
      - struct timespec ts = current_time(inode_node);
      + struct timespec64 ts = current_time(inode_node);
      )
      
      <+... when != ts
      (
      - timespec_equal(&inode_node->i_xtime, &ts)
      + timespec64_equal(&inode_node->i_xtime, &ts)
      |
      - timespec_equal(&ts, &inode_node->i_xtime)
      + timespec64_equal(&ts, &inode_node->i_xtime)
      |
      - timespec_compare(&inode_node->i_xtime, &ts)
      + timespec64_compare(&inode_node->i_xtime, &ts)
      |
      - timespec_compare(&ts, &inode_node->i_xtime)
      + timespec64_compare(&ts, &inode_node->i_xtime)
      |
      ts = current_time(e)
      |
      fn_update_time(..., &ts,...)
      |
      inode_node->i_xtime = ts
      |
      node1->i_xtime = ts
      |
      ts = inode_node->i_xtime
      |
      <+... attr1->ia_xtime ...+> = ts
      |
      ts = attr1->ia_xtime
      |
      ts.tv_sec
      |
      ts.tv_nsec
      |
      btrfs_set_stack_timespec_sec(..., ts.tv_sec)
      |
      btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
      |
      - ts = timespec64_to_timespec(
      + ts =
      ...
      -)
      |
      - ts = ktime_to_timespec(
      + ts = ktime_to_timespec64(
      ...)
      |
      - ts = E3
      + ts = timespec_to_timespec64(E3)
      |
      - ktime_get_real_ts(&ts)
      + ktime_get_real_ts64(&ts)
      |
      fn(...,
      - ts
      + timespec64_to_timespec(ts)
      ,...)
      )
      ...+>
      (
      <... when != ts
      - return ts;
      + return timespec64_to_timespec(ts);
      ...>
      )
      |
      - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
      + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
      |
      - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
      + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
      |
      - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
      + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
      |
      node1->i_xtime1 =
      - timespec_trunc(attr1->ia_xtime1,
      + timespec64_trunc(attr1->ia_xtime1,
      ...)
      |
      - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
      + attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
      ...)
      |
      - ktime_get_real_ts(&attr1->ia_xtime1)
      + ktime_get_real_ts64(&attr1->ia_xtime1)
      |
      - ktime_get_real_ts(&attr.ia_xtime1)
      + ktime_get_real_ts64(&attr.ia_xtime1)
      )
      
      @ depends on patch @
      struct inode *node;
      struct iattr *attr;
      identifier fn;
      identifier i_xtime =~ "^i_[acm]time$";
      identifier ia_xtime =~ "^ia_[acm]time$";
      expression e;
      @@
      (
      - fn(node->i_xtime);
      + fn(timespec64_to_timespec(node->i_xtime));
      |
       fn(...,
      - node->i_xtime);
      + timespec64_to_timespec(node->i_xtime));
      |
      - e = fn(attr->ia_xtime);
      + e = fn(timespec64_to_timespec(attr->ia_xtime));
      )
      
      @ depends on patch forall @
      struct inode *node;
      struct iattr *attr;
      identifier i_xtime =~ "^i_[acm]time$";
      identifier ia_xtime =~ "^ia_[acm]time$";
      identifier fn;
      @@
      {
      + struct timespec ts;
      <+...
      (
      + ts = timespec64_to_timespec(node->i_xtime);
      fn (...,
      - &node->i_xtime,
      + &ts,
      ...);
      |
      + ts = timespec64_to_timespec(attr->ia_xtime);
      fn (...,
      - &attr->ia_xtime,
      + &ts,
      ...);
      )
      ...+>
      }
      
      @ depends on patch forall @
      struct inode *node;
      struct iattr *attr;
      struct kstat *stat;
      identifier ia_xtime =~ "^ia_[acm]time$";
      identifier i_xtime =~ "^i_[acm]time$";
      identifier xtime =~ "^[acm]time$";
      identifier fn, ret;
      @@
      {
      + struct timespec ts;
      <+...
      (
      + ts = timespec64_to_timespec(node->i_xtime);
      ret = fn (...,
      - &node->i_xtime,
      + &ts,
      ...);
      |
      + ts = timespec64_to_timespec(node->i_xtime);
      ret = fn (...,
      - &node->i_xtime);
      + &ts);
      |
      + ts = timespec64_to_timespec(attr->ia_xtime);
      ret = fn (...,
      - &attr->ia_xtime,
      + &ts,
      ...);
      |
      + ts = timespec64_to_timespec(attr->ia_xtime);
      ret = fn (...,
      - &attr->ia_xtime);
      + &ts);
      |
      + ts = timespec64_to_timespec(stat->xtime);
      ret = fn (...,
      - &stat->xtime);
      + &ts);
      )
      ...+>
      }
      
      @ depends on patch @
      struct inode *node;
      struct inode *node2;
      identifier i_xtime1 =~ "^i_[acm]time$";
      identifier i_xtime2 =~ "^i_[acm]time$";
      identifier i_xtime3 =~ "^i_[acm]time$";
      struct iattr *attrp;
      struct iattr *attrp2;
      struct iattr attr ;
      identifier ia_xtime1 =~ "^ia_[acm]time$";
      identifier ia_xtime2 =~ "^ia_[acm]time$";
      struct kstat *stat;
      struct kstat stat1;
      struct timespec64 ts;
      identifier xtime =~ "^[acmb]time$";
      expression e;
      @@
      (
      ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
      |
       node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
      |
       node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
      |
       node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
      |
       stat->xtime = node2->i_xtime1;
      |
       stat1.xtime = node2->i_xtime1;
      |
      ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
      |
      ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
      |
      - e = node->i_xtime1;
      + e = timespec64_to_timespec( node->i_xtime1 );
      |
      - e = attrp->ia_xtime1;
      + e = timespec64_to_timespec( attrp->ia_xtime1 );
      |
      node->i_xtime1 = current_time(...);
      |
       node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
      - e;
      + timespec_to_timespec64(e);
      |
       node->i_xtime1 = node->i_xtime3 =
      - e;
      + timespec_to_timespec64(e);
      |
      - node->i_xtime1 = e;
      + node->i_xtime1 = timespec_to_timespec64(e);
      )
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: <anton@tuxera.com>
      Cc: <balbi@kernel.org>
      Cc: <bfields@fieldses.org>
      Cc: <darrick.wong@oracle.com>
      Cc: <dhowells@redhat.com>
      Cc: <dsterba@suse.com>
      Cc: <dwmw2@infradead.org>
      Cc: <hch@lst.de>
      Cc: <hirofumi@mail.parknet.co.jp>
      Cc: <hubcap@omnibond.com>
      Cc: <jack@suse.com>
      Cc: <jaegeuk@kernel.org>
      Cc: <jaharkes@cs.cmu.edu>
      Cc: <jslaby@suse.com>
      Cc: <keescook@chromium.org>
      Cc: <mark@fasheh.com>
      Cc: <miklos@szeredi.hu>
      Cc: <nico@linaro.org>
      Cc: <reiserfs-devel@vger.kernel.org>
      Cc: <richard@nod.at>
      Cc: <sage@redhat.com>
      Cc: <sfrench@samba.org>
      Cc: <swhiteho@redhat.com>
      Cc: <tj@kernel.org>
      Cc: <trond.myklebust@primarydata.com>
      Cc: <tytso@mit.edu>
      Cc: <viro@zeniv.linux.org.uk>
      95582b00
  4. 30 May, 2018 8 commits
    • Omar Sandoval's avatar
      Btrfs: clean up error handling in btrfs_truncate() · ad7e1a74
      Omar Sandoval authored
      
      
      btrfs_truncate() uses two variables for error handling, ret and err (if
      this sounds familiar, it's because btrfs_truncate_inode_items() did
      something similar). This is error prone, as was made evident by "Btrfs:
      fix error handling in btrfs_truncate()". We only have err because we
      don't want to mask an error if we call btrfs_update_inode() and
      btrfs_end_transaction(), so let's make that its own scoped return
      variable and use ret everywhere else.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad7e1a74
    • Nikolay Borisov's avatar
      btrfs: Factor out write portion of btrfs_get_blocks_direct · c5794e51
      Nikolay Borisov authored
      
      
      Now that the read side is extracted into its own function, do the same
      to the write side. This leaves btrfs_get_blocks_direct_write with the
      sole purpose of handling common locking required. Also flip the
      condition in btrfs_get_blocks_direct_write so that the write case
      comes first and we check for if (Create) rather than if (!create). This
      is purely subjective but I believe makes reading a bit more "linear".
      No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c5794e51
    • Nikolay Borisov's avatar
      btrfs: Factor out read portion of btrfs_get_blocks_direct · 1c8d0175
      Nikolay Borisov authored
      
      
      Currently this function handles both the READ and WRITE dio cases. This
      is facilitated by a bunch of 'if' statements, a goto short-circuit
      statement and a very perverse aliasing of "!created"(READ) case
      by setting lockstart = lockend and checking for lockstart < lockend for
      detecting the write. Let's simplify this mess by extracting the
      READ-only code into a separate __btrfs_get_block_direct_read function.
      This is only the first step, the next one will be to factor out the
      write side as well. The end goal will be to have the common locking/
      unlocking code in btrfs_get_blocks_direct and then it will call either
      the read|write subvariants. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c8d0175
    • Su Yue's avatar
      btrfs: return error value if create_io_em failed in cow_file_range · 090a127a
      Su Yue authored
      In cow_file_range(), create_io_em() may fail, but its return value is
      not recorded.  Then return value may be 0 even it failed which is a
      wrong behavior.
      
      Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.
      
      Fixes: 6f9994db
      
       ("Btrfs: create a helper to create em for IO")
      CC: stable@vger.kernel.org # 4.11+
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      090a127a
    • Gu JinXiang's avatar
      btrfs: drop unused parameter qgroup_reserved · c4c129db
      Gu JinXiang authored
      Since commit 7775c818
      
       ("btrfs: remove unused parameter from
      btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
      by caller of function btrfs_subvolume_reserve_metadata.  So remove it.
      Signed-off-by: default avatarGu JinXiang <gujx@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4c129db
    • Ethan Lien's avatar
      btrfs: balance dirty metadata pages in btrfs_finish_ordered_io · e73e81b6
      Ethan Lien authored
      
      
      [Problem description and how we fix it]
      We should balance dirty metadata pages at the end of
      btrfs_finish_ordered_io, since a small, unmergeable random write can
      potentially produce dirty metadata which is multiple times larger than
      the data itself. For example, a small, unmergeable 4KiB write may
      produce:
      
          16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
      
      Although we do call balance dirty pages in write side, but in the
      buffered write path, most metadata are dirtied only after we reach the
      dirty background limit (which by far only counts dirty data pages) and
      wakeup the flusher thread. If there are many small, unmergeable random
      writes spread in a large btree, we'll find a burst of dirty pages
      exceeds the dirty_bytes limit after we wakeup the flusher thread - which
      is not what we expect. In our machine, it caused out-of-memory problem
      since a page cannot be dropped if it is marked dirty.
      
      Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
      but since we do btrfs_finish_ordered_io in a separate worker, it will not
      stop the flusher consuming dirty pages. Also, we use different worker for
      metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
      the size of dirty metadata pages.
      
      [Reproduce steps]
      To reproduce the problem, we need to do 4KiB write randomly spread in a
      large btree. In our 2GiB RAM machine:
      
      1) Create 4 subvolumes.
      2) Run fio on each subvolume:
      
         [global]
         direct=0
         rw=randwrite
         ioengine=libaio
         bs=4k
         iodepth=16
         numjobs=1
         group_reporting
         size=128G
         runtime=1800
         norandommap
         time_based
         randrepeat=0
      
      3) Take snapshot on each subvolume and repeat fio on existing files.
      4) Repeat step (3) until we get large btrees.
         In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
         metadata in each subvolume tree and 12GiB of metadata in extent tree.
      5) Stop all fio, take snapshot again, and wait until all delayed work is
         completed.
      6) Start all fio. Few seconds later we hit OOM when the flusher starts
         to work.
      
      It can be reproduced even when using nocow write.
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e73e81b6
    • Ethan Lien's avatar
      btrfs: lift some btrfs_cross_ref_exist checks in nocow path · 78d4295b
      Ethan Lien authored
      
      
      In nocow path, we check if the extent is snapshotted in
      btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
      unnecessary search into extent tree.
      
      A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
      
      [global]
      group_reporting
      time_based
      thread=1
      ioengine=libaio
      bs=4k
      iodepth=32
      size=64G
      runtime=180
      numjobs=8
      rw=randwrite
      
      [file1]
      filename=/mnt/nocow/testfile
      
      IOPS result:   unpatched     patched
      
      1 fio round:     46670        46958
      snapshot
      2 fio round:     51826        54498
      3 fio round:     59767        61289
      
      After snapshot, the first fio get about 5% performance gain. As we
      continually write to the same file, all writes will resume to nocow mode
      and eventually we have no performance gain.
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78d4295b
    • Lu Fengqi's avatar
      btrfs: Remove fs_info argument from btrfs_uuid_tree_rem · d1957791
      Lu Fengqi authored
      
      
      This function always takes a transaction handle which contains a
      reference to the fs_info. Use that and remove the extra argument.
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      [ rename the function ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1957791
  5. 28 May, 2018 29 commits