1. 15 Aug, 2013 1 commit
    • Yan, Zheng's avatar
      ceph: introduce i_truncate_mutex · b0d7c223
      Yan, Zheng authored
      I encountered below deadlock when running fsstress
      wmtruncate work      truncate                 MDS
      ---------------  ------------------  --------------------------
                         lock i_mutex
                                            <- truncate file
      lock i_mutex (blocked)
                                            <- revoking Fcb (filelock to MIX)
                         send request ->
                                               handle request (xlock filelock)
      At the initial time, there are some dirty pages in the page cache.
      When the kclient receives the truncate message, it reduces inode size
      and creates some 'out of i_size' dirty pages. wmtruncate work can't
      truncate these dirty pages because it's blocked by the i_mutex. Later
      when the kclient receives the cap message that revokes Fcb caps, It
      can't flush all dirty pages because writepages() only flushes dirty
      pages within the inode size.
      When the MDS handles the 'truncate' request from kclient, it waits
      for the filelock to become stable. But the filelock is stuck in
      unstable state because it can't finish revoking kclient's Fcb caps.
      The truncate pagecache locking has already caused lots of trouble
      for use. I think it's time simplify it by introducing a new mutex.
      We use the new mutex to prevent concurrent truncate_inode_pages().
      There is no need to worry about race between buffered write and
      truncate_inode_pages(), because our "get caps" mechanism prevents
      them from concurrent execution.
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
  2. 10 Aug, 2013 1 commit
    • Yan, Zheng's avatar
      ceph: fix freeing inode vs removing session caps race · 6f60f889
      Yan, Zheng authored
      remove_session_caps() uses iterate_session_caps() to remove caps,
      but iterate_session_caps() skips inodes that are being deleted.
      So session->s_nr_caps can be non-zero after iterate_session_caps()
      We can fix the issue by waiting until deletions are complete.
      __wait_on_freeing_inode() is designed for the job, but it is not
      exported, so we use lookup inode function to access it.
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
  3. 03 Jul, 2013 2 commits
  4. 17 May, 2013 1 commit
    • Jim Schutt's avatar
      ceph: ceph_pagelist_append might sleep while atomic · 39be95e9
      Jim Schutt authored
      Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc()
      while holding a lock, but it's spoiled because ceph_pagelist_addpage()
      always calls kmap(), which might sleep.  Here's the result:
      [13439.295457] ceph: mds0 reconnect start
      [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58
      [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
          . . .
      [13439.376225] Call Trace:
      [13439.378757]  [<ffffffff81076f4c>] __might_sleep+0xfc/0x110
      [13439.384353]  [<ffffffffa03f4ce0>] ceph_pagelist_append+0x120/0x1b0 [libceph]
      [13439.391491]  [<ffffffffa0448fe9>] ceph_encode_locks+0x89/0x190 [ceph]
      [13439.398035]  [<ffffffff814ee849>] ? _raw_spin_lock+0x49/0x50
      [13439.403775]  [<ffffffff811cadf5>] ? lock_flocks+0x15/0x20
      [13439.409277]  [<ffffffffa045e2af>] encode_caps_cb+0x41f/0x4a0 [ceph]
      [13439.415622]  [<ffffffff81196748>] ? igrab+0x28/0x70
      [13439.420610]  [<ffffffffa045e9f8>] ? iterate_session_caps+0xe8/0x250 [ceph]
      [13439.427584]  [<ffffffffa045ea25>] iterate_session_caps+0x115/0x250 [ceph]
      [13439.434499]  [<ffffffffa045de90>] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
      [13439.441646]  [<ffffffffa0462888>] send_mds_reconnect+0x238/0x450 [ceph]
      [13439.448363]  [<ffffffffa0464542>] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
      [13439.455250]  [<ffffffffa0462e42>] check_new_map+0x352/0x500 [ceph]
      [13439.461534]  [<ffffffffa04631ad>] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
      [13439.468432]  [<ffffffff814ebc7e>] ? mutex_unlock+0xe/0x10
      [13439.473934]  [<ffffffffa043c612>] extra_mon_dispatch+0x22/0x30 [ceph]
      [13439.480464]  [<ffffffffa03f6c2c>] dispatch+0xbc/0x110 [libceph]
      [13439.486492]  [<ffffffffa03eec3d>] process_message+0x1ad/0x1d0 [libceph]
      [13439.493190]  [<ffffffffa03f1498>] ? read_partial_message+0x3e8/0x520 [libceph]
          . . .
      [13439.587132] ceph: mds0 reconnect success
      [13490.720032] ceph: mds0 caps stale
      [13501.235257] ceph: mds0 recovery completed
      [13501.300419] ceph: mds0 caps renewed
      Fix it up by encoding locks into a buffer first, and when the number
      of encoded locks is stable, copy that into a ceph_pagelist.
      [elder@inktank.com: abbreviated the stack info a bit.]
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: default avatarJim Schutt <jaschut@sandia.gov>
      Reviewed-by: default avatarAlex Elder <elder@inktank.com>
  5. 02 May, 2013 4 commits
  6. 22 Feb, 2013 1 commit
  7. 20 Feb, 2013 1 commit
  8. 12 Feb, 2013 1 commit
    • Eric W. Biederman's avatar
      ceph: Translate between uid and gids in cap messages and kuids and kgids · 05cb11c1
      Eric W. Biederman authored
      - Make the uid and gid arguments of send_cap_msg() used to compose
        ceph_mds_caps messages of type kuid_t and kgid_t.
      - Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg()
        through variables of type kuid_t and kgid_t.
      - Modify struct ceph_cap_snap to store uids and gids in types kuid_t
        and kgid_t.  This allows capturing inode->i_uid and inode->i_gid in
        ceph_queue_cap_snap() without loss and pssing them to
        __ceph_flush_snaps() where they are removed from struct
        ceph_cap_snap and passed to send_cap_msg().
      - In handle_cap_grant translate uid and gids in the initial user
        namespace stored in struct ceph_mds_cap into kuids and kgids
        before setting inode->i_uid and inode->i_gid.
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  9. 02 Aug, 2012 1 commit
    • Sage Weil's avatar
      ceph: simplify+fix atomic_open · 5ef50c3b
      Sage Weil authored
      The initial ->atomic_open op was carried over from the old intent code,
      which was incomplete and didn't really work.  Replace it with a fresh
      method.  In particular:
       * always attempt to do an atomic open+lookup, both for the create case
         and for lookups of existing files.
       * fix symlink handling by returning 1 to the VFS so that we can follow
         the link to its destination. This fixes a longstanding ceph bug (#2392).
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
  10. 31 Jul, 2012 1 commit
    • Alex Elder's avatar
      ceph: define snap counts as u32 everywhere · aa711ee3
      Alex Elder authored
      There are two structures in which a count of snapshots are
          struct ceph_snap_context {
              u32 num_snaps;
          struct ceph_snap_realm {
              u32 num_prior_parent_snaps;   /*  had prior to parent_since */
              u32 num_snaps;
      These fields never take on negative values (e.g., to hold special
      meaning), and so are really inherently unsigned.  Furthermore they
      take their value from over-the-wire or on-disk formatted 32-bit
      So change their definition to have type u32, and change some spots
      elsewhere in the code to account for this change.
      Signed-off-by: default avatarAlex Elder <elder@inktank.com>
      Reviewed-by: default avatarJosh Durgin <josh.durgin@inktank.com>
  11. 14 Jul, 2012 5 commits
  12. 22 Mar, 2012 2 commits
  13. 12 Jan, 2012 1 commit
  14. 04 Jan, 2012 1 commit
  15. 07 Dec, 2011 1 commit
    • Sage Weil's avatar
      ceph: use i_ceph_lock instead of i_lock · be655596
      Sage Weil authored
      We have been using i_lock to protect all kinds of data structures in the
      ceph_inode_info struct, including lists of inodes that we need to iterate
      over while avoiding races with inode destruction.  That requires grabbing
      a reference to the inode with the list lock protected, but igrab() now
      takes i_lock to check the inode flags.
      Changing the list lock ordering would be a painful process.
      However, using a ceph-specific i_ceph_lock in the ceph inode instead of
      i_lock is a simple mechanical change and avoids the ordering constraints
      imposed by igrab().
      Reported-by: default avatarAmon Ott <a.ott@m-privacy.de>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
  16. 06 Nov, 2011 1 commit
  17. 03 Nov, 2011 1 commit
  18. 25 Oct, 2011 2 commits
  19. 26 Jul, 2011 6 commits
  20. 21 Jul, 2011 1 commit
    • Josef Bacik's avatar
      fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers · 02c24a82
      Josef Bacik authored
      Btrfs needs to be able to control how filemap_write_and_wait_range() is called
      in fsync to make it less of a painful operation, so push down taking i_mutex and
      the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
      file systems can drop taking the i_mutex altogether it seems, like ext3 and
      ocfs2.  For correctness sake I just pushed everything down in all cases to make
      sure that we keep the current behavior the same for everybody, and then each
      individual fs maintainer can make up their mind about what to do from there.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  21. 20 Jul, 2011 1 commit
  22. 11 May, 2011 1 commit
  23. 04 May, 2011 1 commit
  24. 21 Mar, 2011 2 commits