1. 01 Nov, 2011 1 commit
    • Minchan Kim's avatar
      vmscan: add barrier to prevent evictable page in unevictable list · 21ee9f39
      Minchan Kim authored
      When a race between putback_lru_page() and shmem_lock with lock=0 happens,
      progrom execution order is as follows, but clear_bit in processor #1 could
      be reordered right before spin_unlock of processor #1.  Then, the page
      would be stranded on the unevictable list.
      
      spin_lock
      SetPageLRU
      spin_unlock
                                      clear_bit(AS_UNEVICTABLE)
                                      spin_lock
                                      if PageLRU()
                                              if !test_bit(AS_UNEVICTABLE)
                                              	move evictable list
      smp_mb
      if !test_bit(AS_UNEVICTABLE)
              move evictable list
                                      spin_unlock
      
      But, pagevec_lookup() in scan_mapping_unevictable_pages() has
      rcu_read_[un]lock() so it could protect reordering before reaching
      test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
      But it's a unexpected side effect and we should solve this problem
      properly.
      
      This patch adds a barrier after mapping_clear_unevictable.
      
      I didn't meet this problem but just found during review.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21ee9f39
  2. 04 Aug, 2011 11 commits
    • Hugh Dickins's avatar
      mm: clarify the radix_tree exceptional cases · 8079b1c8
      Hugh Dickins authored
      Make the radix_tree exceptional cases, mostly in filemap.c, clearer.
      
      It's hard to devise a suitable snappy name that illuminates the use by
      shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
      generality.  And akpm points out that /* radix_tree_deref_retry(page) */
      comments look like calls that have been commented out for unknown
      reason.
      
      Skirt the naming difficulty by rearranging these blocks to handle the
      transient radix_tree_deref_retry(page) case first; then just explain the
      remaining shmem/tmpfs swap case in a comment.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8079b1c8
    • Hugh Dickins's avatar
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins authored
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • Hugh Dickins's avatar
      tmpfs: use kmemdup for short symlinks · 69f07ec9
      Hugh Dickins authored
      But we've not yet removed the old swp_entry_t i_direct[16] from
      shmem_inode_info.  That's because it was still being shared with the
      inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
      size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
      
      I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
      rather than shmem_evict_inode(), where we usually do such freeing? I
      guess it doesn't matter, and I'm not into NUMA mpol testing right now.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69f07ec9
    • Hugh Dickins's avatar
      tmpfs: convert shmem_writepage and enable swap · 6922c0c7
      Hugh Dickins authored
      Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
      shmem_radix_tree_replace() to substitute swap entry for page pointer
      atomically in the radix tree.
      
      As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
      copying such code from delete_from_swap_cache, but again judged easier
      to sell than making its other callers go through the extras.
      
      Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
      now unreferenced, and the hack to disable swap: it's now good to go.
      
      The way things have worked out, info->lock no longer helps to guard the
      shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
      That global mutex exclusion between shmem_writepage() and shmem_unuse()
      is not pretty, and we ought to find another way; but it's been forced on
      us by recent race discoveries, not a consequence of this patchset.
      
      And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
      swap entry was found already present? That's no longer possible, the
      (unknown) one inserting this page into filecache would hit the swap
      entry occupying that slot.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6922c0c7
    • Hugh Dickins's avatar
      tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895
      Hugh Dickins authored
      Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
      had to move swappage to filecache with GFP_NOWAIT.
      
      Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
      moving its call out from shmem_add_to_page_cache() to two of thats three
      callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
      although asymmetrical, it's easier for all 3 callers to handle.
      
      These two changes would also be appropriate if anyone were to start
      using shmem_read_mapping_page_gfp() with GFP_NOWAIT.
      
      Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
      radix_tree_exceptional_entry() to get what it needs for itself.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa3b1895
    • Hugh Dickins's avatar
      tmpfs: convert shmem_getpage_gfp to radix-swap · 54af6042
      Hugh Dickins authored
      Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
      swap entry returned from radix tree by find_lock_page().
      
      Whereas the repetitive old method proceeded mainly under info->lock,
      dropping and repeating whenever one of the conditions needed was not
      met, now we can proceed without it, leaving shmem_add_to_page_cache() to
      check for a race.
      
      This way there is no need to preallocate a page, no need for an early
      radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().
      
      Move the error unwinding down to the bottom instead of repeating it
      throughout.  ENOSPC handling is a little different from before: there is
      no longer any race between find_lock_page() and finding swap, but we can
      arrive at ENOSPC before calling shmem_recalc_inode(), which might
      occasionally discover freed space.
      
      Be stricter to check i_size before returning.  info->lock is used for
      little but alloced, swapped, i_blocks updates.  Move i_blocks updates
      out from under the max_blocks check, so even an unlimited size=0 mount
      can show accurate du.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54af6042
    • Hugh Dickins's avatar
      tmpfs: convert shmem_unuse_inode to radix-swap · 46f65ec1
      Hugh Dickins authored
      Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
      tree, searching for matching swap.
      
      This is somewhat slower than the old method: because of repeated radix
      tree descents, because of copying entries up, but probably most because
      the old method noted and skipped once a vector page was cleared of swap.
      Perhaps we can devise a use of radix tree tagging to achieve that later.
      
      shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
      for the lockless lookup by checking that the expected entry is in place,
      under lock.  It is not very satisfactory to be copying this much from
      add_to_page_cache_locked(), but I think easier to sell than insisting
      that every caller of add_to_page_cache*() go through the extras.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46f65ec1
    • Hugh Dickins's avatar
      tmpfs: convert shmem_truncate_range to radix-swap · 7a5d0fbb
      Hugh Dickins authored
      Disable the toy swapping implementation in shmem_writepage() - it's hard
      to support two schemes at once - and convert shmem_truncate_range() to a
      lockless gang lookup of swap entries along with pages, freeing both.
      
      Since the second loop tightens its noose until all entries of either
      kind have been squeezed out (and we shall make sure that there's not an
      instant when neither is visible), there is no longer a need for yet
      another pass below.
      
      shmem_radix_tree_replace() compensates for the lockless lookup by
      checking that the expected entry is in place, under lock, before
      replacing it.  Here it just deletes, but will be used in later patches
      to substitute swap entry for page or page for swap entry.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a5d0fbb
    • Hugh Dickins's avatar
      tmpfs: copy truncate_inode_pages_range · bda97eab
      Hugh Dickins authored
      Bring truncate.c's code for truncate_inode_pages_range() inline into
      shmem_truncate_range(), replacing its first call (there's a followup
      call below, but leave that one, it will disappear next).
      
      Don't play with it yet, apart from leaving out the cleancache flush, and
      (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
      partial page preparation into its partial page handling.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda97eab
    • Hugh Dickins's avatar
      tmpfs: miscellaneous trivial cleanups · 41ffe5d5
      Hugh Dickins authored
      While it's at its least, make a number of boring nitpicky cleanups to
      shmem.c, mostly for consistency of variable naming.  Things like "swap"
      instead of "entry", "pgoff_t index" instead of "unsigned long idx".
      
      And since everything else here is prefixed "shmem_", better change
      init_tmpfs() to shmem_init().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41ffe5d5
    • Hugh Dickins's avatar
      tmpfs: demolish old swap vector support · 285b2c4f
      Hugh Dickins authored
      The maximum size of a shmem/tmpfs file has been limited by the maximum
      size of its triple-indirect swap vector.  With 4kB page size, maximum
      filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
      that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
      over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
      MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)
      
      It's a shame that tmpfs should be more restrictive than ramfs, and this
      limitation has now been noticed.  Add another level to the swap vector?
      No, it became obscure and hard to maintain, once I complicated it to
      make use of highmem pages nine years ago: better choose another way.
      
      Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
      tmpfs would never have invented its own peculiar radix tree: we would
      have fitted swap entries into the common radix tree instead, in much the
      same way as we fit swap entries into page tables.
      
      And why should each file have a separate radix tree for its pages and
      for its swap entries? The swap entries are required precisely where and
      when the pages are not.  We want to put them together in a single radix
      tree: which can then avoid much of the locking which was needed to
      prevent them from being exchanged underneath us.
      
      This also avoids the waste of memory devoted to swap vectors, first in
      the shmem_inode itself, then at least two more pages once a file grew
      beyond 16 data pages (pages accounted by df and du, but not by memcg).
      Allocated upfront, to avoid allocation when under swapping pressure, but
      pure waste when CONFIG_SWAP is not set - I have never spattered around
      the ifdefs to prevent that, preferring this move to sharing the common
      radix tree instead.
      
      There are three downsides to sharing the radix tree.  One, that it binds
      tmpfs more tightly to the rest of mm, either requiring knowledge of swap
      entries in radix tree there, or duplication of its code here in shmem.c.
      I believe that the simplications and memory savings (and probable higher
      performance, not yet measured) justify that.
      
      Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
      nodes that cannot be freed under memory pressure - whereas before it was
      the less precious highmem swap vector pages that could not be freed.
      I'm hoping that 64-bit has now been accessible for long enough, that the
      highmem argument has grown much less persuasive.
      
      Three, that swapoff is slower than it used to be on tmpfs files, since
      it's using a simple generic mechanism not tailored to it: I find this
      noticeable, and shall want to improve, but maybe nobody else will
      notice.
      
      So...  now remove most of the old swap vector code from shmem.c.  But,
      for the moment, keep the simple i_direct vector of 16 pages, with simple
      accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
      to help mark where swap needs to be handled in subsequent patches.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      285b2c4f
  3. 26 Jul, 2011 8 commits
    • Hugh Dickins's avatar
      tmpfs: simplify unuse and writepage · 48f170fb
      Hugh Dickins authored
      shmem_unuse_inode() and shmem_writepage() contain a little code to cope
      with pages inserted independently into the filecache, probably by a
      filesystem stacked on top of tmpfs, then fed to its ->readpage() or
      ->writepage().
      
      Unionfs was indeed experimenting with working in that way three years ago,
      but I find no current examples: nowadays the stacking filesystems use vfs
      interfaces to the lower filesystem.
      
      It's now illegal: remove most of that code, adding some WARN_ON_ONCEs.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Erez Zadok <ezk@fsl.cs.sunysb.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48f170fb
    • Hugh Dickins's avatar
      tmpfs: simplify filepage/swappage · 27ab7006
      Hugh Dickins authored
      We can now simplify shmem_getpage_gfp(): there is no longer a dilemma of
      filepage passed in via shmem_readpage(), then swappage found, which must
      then be copied over to it.
      
      Although at first it's tempting to replace the **pagep arg by returning
      struct page *, that makes a mess of IS_ERR_OR_NULL(page)s in all the
      callers, so leave as is.
      
      Insert BUG_ON(!PageUptodate) when we find and lock page: some of the
      complication came from uninitialized pages inserted into filecache prior
      to readpage; but now we're in control, and only release pagelock on
      filecache once it's uptodate (if an error occurs in reading back from
      swap, the page remains in swapcache, never moved to filecache).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ab7006
    • Hugh Dickins's avatar
      tmpfs: simplify prealloc_page · e83c32e8
      Hugh Dickins authored
      The prealloc_page handling in shmem_getpage_gfp() is unnecessarily
      complicated: first simplify that before going on to filepage/swappage.
      
      That's right, don't report ENOMEM when the preallocation fails: we may or
      may not need the page.  But simply report ENOMEM once we find we do need
      it, instead of dropping lock, repeating allocation, unwinding on failure
      etc.  And leave the out label on the fast path, don't goto.
      
      Fix something that looks like a bug but turns out not to be: set
      PageSwapBacked on prealloc_page before its mem_cgroup_cache_charge(), as
      the removed case was doing.  That's important before adding to LRU
      (determines which LRU the page goes on), and does affect which path it
      takes through memcontrol.c, but in the end MEM_CGROUP_CHANGE_TYPE_ SHMEM
      is handled no differently from CACHE.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: "Zhang, Yanmin" <yanmin.zhang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e83c32e8
    • Hugh Dickins's avatar
      tmpfs: remove_shmem_readpage · 9276aad6
      Hugh Dickins authored
      Remove that pernicious shmem_readpage() at last: the things we needed it
      for (splice, loop, sendfile, i915 GEM) are now fully taken care of by
      shmem_file_splice_read() and shmem_read_mapping_page_gfp().
      
      This removal clears the way for a simpler shmem_getpage_gfp(), since page
      is never passed in; but leave most of that cleanup until after.
      
      sys_readahead() and sys_fadvise(POSIX_FADV_WILLNEED) will now EINVAL,
      instead of unexpectedly trying to read ahead on tmpfs: if that proves to
      be an issue for someone, then we can either arrange for them to return
      success instead, or try to implement async readahead on tmpfs.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9276aad6
    • Hugh Dickins's avatar
      tmpfs: pass gfp to shmem_getpage_gfp · 68da9f05
      Hugh Dickins authored
      Make shmem_getpage() a wrapper, passing mapping_gfp_mask() down to
      shmem_getpage_gfp(), which in turn passes gfp down to shmem_swp_alloc().
      
      Change shmem_read_mapping_page_gfp() to use shmem_getpage_gfp() in the
      CONFIG_SHMEM case; but leave tiny !SHMEM using read_cache_page_gfp().
      
      Add a BUG_ON() in case anyone happens to call this on a non-shmem mapping;
      though we might later want to let that case route to read_cache_page_gfp().
      
      It annoys me to have these two almost-redundant args, gfp and fault_type:
      I can't find a better way; but initialize fault_type only in shmem_fault().
      
      Note that before, read_cache_page_gfp() was allocating i915_gem's pages
      with __GFP_NORETRY as intended; but the corresponding swap vector pages
      got allocated without it, leaving a small possibility of OOM.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68da9f05
    • Hugh Dickins's avatar
      tmpfs: refine shmem_file_splice_read · 71f0e07a
      Hugh Dickins authored
      Tidy up shmem_file_splice_read():
      
      Remove readahead: okay, we could implement shmem readahead on swap,
      but have never done so before, swap being the slow exceptional path.
      
      Use shmem_getpage() instead of find_or_create_page() plus ->readpage().
      
      Remove several comments: sorry, I found them more distracting than
      helpful, and this will not be the reference version of splice_read().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71f0e07a
    • Hugh Dickins's avatar
      tmpfs: clone shmem_file_splice_read() · 708e3508
      Hugh Dickins authored
      Copy __generic_file_splice_read() and generic_file_splice_read() from
      fs/splice.c to shmem_file_splice_read() in mm/shmem.c.  Make
      page_cache_pipe_buf_ops and spd_release_page() accessible to it.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      708e3508
    • Hugh Dickins's avatar
      tmpfs: no need to use i_lock · d515afe8
      Hugh Dickins authored
      2.6.36's 7e496299 ("tmpfs: make tmpfs scalable with percpu_counter for
      used blocks") to make tmpfs scalable with percpu_counter used
      inode->i_lock in place of sbinfo->stat_lock around i_blocks updates; but
      that was adverse to scalability, and unnecessary, since info->lock is
      already held there in the fast paths.
      
      Remove those uses of i_lock, and add info->lock in the three error paths
      where it's then needed across shmem_free_blocks().  It's not actually
      needed across shmem_unacct_blocks(), but they're so often paired that it
      looks wrong to split them apart.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d515afe8
  4. 25 Jul, 2011 1 commit
  5. 18 Jul, 2011 1 commit
    • Mimi Zohar's avatar
      security: new security_inode_init_security API adds function callback · 9d8f13ba
      Mimi Zohar authored
      This patch changes the security_inode_init_security API by adding a
      filesystem specific callback to write security extended attributes.
      This change is in preparation for supporting the initialization of
      multiple LSM xattrs and the EVM xattr.  Initially the callback function
      walks an array of xattrs, writing each xattr separately, but could be
      optimized to write multiple xattrs at once.
      
      For existing security_inode_init_security() calls, which have not yet
      been converted to use the new callback function, such as those in
      reiserfs and ocfs2, this patch defines security_old_inode_init_security().
      Signed-off-by: default avatarMimi Zohar <zohar@us.ibm.com>
      9d8f13ba
  6. 28 Jun, 2011 2 commits
    • Hugh Dickins's avatar
      tmpfs: add shmem_read_mapping_page_gfp · d9d90e5e
      Hugh Dickins authored
      Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
      is unsuited to tmpfs, because it inserts a page into pagecache before
      calling the filesystem's ->readpage: tmpfs may have pages in swapcache
      which only it knows how to locate and switch to filecache.
      
      At present tmpfs provides a ->readpage method, and copes with this by
      copying pages; but soon we can simplify it by removing its ->readpage.
      Provide shmem_read_mapping_page_gfp() now, ready for that transition,
      
      Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
      with shmem_read_mapping_page() inline for the common mapping_gfp case.
      
      (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
      read_mapping_page functions use the mapping's ->readpage, and the
      read_cache_page functions use the supplied filler, so I think
      read_cache_page_gfp was slightly misnamed.)
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9d90e5e
    • Hugh Dickins's avatar
      tmpfs: take control of its truncate_range · 94c1e62d
      Hugh Dickins authored
      2.6.35's new truncate convention gave tmpfs the opportunity to control
      its file truncation, no longer enforced from outside by vmtruncate().
      We shall want to build upon that, to handle pagecache and swap together.
      
      Slightly redefine the ->truncate_range interface: let it now be called
      between the unmap_mapping_range()s, with the filesystem responsible for
      doing the truncate_inode_pages_range() from it - just as the filesystem
      is nowadays responsible for doing that from its ->setattr.
      
      Let's rename shmem_notify_change() to shmem_setattr().  Instead of
      calling the generic truncate_setsize(), bring that code in so we can
      call shmem_truncate_range() - which will later be updated to perform its
      own variant of truncate_inode_pages_range().
      
      Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
      now that the COW's unmap_mapping_range() comes after ->truncate_range,
      there is no need to call it a third time.
      
      Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
      that i915_gem_object_truncate() can call it explicitly in future; get
      this patch in first, then update drm/i915 once this is available (until
      then, i915 will just be doing the truncate_inode_pages() twice).
      
      Though introduced five years ago, no other filesystem is implementing
      ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
      expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
      whereupon ->truncate_range can be removed from inode_operations -
      shmem_truncate_range() will help i915 across that transition too.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94c1e62d
  7. 28 May, 2011 1 commit
    • Hugh Dickins's avatar
      tmpfs: fix race between truncate and writepage · 826267cf
      Hugh Dickins authored
      While running fsx on tmpfs with a memhog then swapoff, swapoff was hanging
      (interruptibly), repeatedly failing to locate the owner of a 0xff entry in
      the swap_map.
      
      Although shmem_writepage() does abandon when it sees incoming page index
      is beyond eof, there was still a window in which shmem_truncate_range()
      could come in between writepage's dropping lock and updating swap_map,
      find the half-completed swap_map entry, and in trying to free it,
      leave it in a state that swap_shmem_alloc() could not correct.
      
      Arguably a bug in __swap_duplicate()'s and swap_entry_free()'s handling
      of the different cases, but easiest to fix by moving swap_shmem_alloc()
      under cover of the lock.
      
      More interesting than the bug: it's been there since 2.6.33, why could
      I not see it with earlier kernels?  The mmotm of two weeks ago seems to
      have some magic for generating races, this is just one of three I found.
      
      With yesterday's git I first saw this in mainline, bisected in search of
      that magic, but the easy reproducibility evaporated.  Oh well, fix the bug.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      826267cf
  8. 27 May, 2011 1 commit
    • Ying Han's avatar
      memcg: add the pagefault count into memcg stats · 456f998e
      Ying Han authored
      Two new stats in per-memcg memory.stat which tracks the number of page
      faults and number of major page faults.
      
        "pgfault"
        "pgmajfault"
      
      They are different from "pgpgin"/"pgpgout" stat which count number of
      pages charged/discharged to the cgroup and have no meaning of reading/
      writing page to disk.
      
      It is valuable to track the two stats for both measuring application's
      performance as well as the efficiency of the kernel page reclaim path.
      Counting pagefaults per process is useful, but we also need the aggregated
      value since processes are monitored and controlled in cgroup basis in
      memcg.
      
      Functional test: check the total number of pgfault/pgmajfault of all
      memcgs and compare with global vmstat value:
      
        $ cat /proc/vmstat | grep fault
        pgfault 1070751
        pgmajfault 553
      
        $ cat /dev/cgroup/memory.stat | grep fault
        pgfault 1071138
        pgmajfault 553
        total_pgfault 1071142
        total_pgmajfault 553
      
        $ cat /dev/cgroup/A/memory.stat | grep fault
        pgfault 199
        pgmajfault 0
        total_pgfault 199
        total_pgmajfault 0
      
      Performance test: run page fault test(pft) wit 16 thread on faulting in
      15G anon pages in 16G container.  There is no regression noticed on the
      "flt/cpu/s"
      
      Sample output from pft:
      
        TAG pft:anon-sys-default:
          Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
          15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
      
        +-------------------------------------------------------------------------+
            N           Min           Max        Median           Avg        Stddev
        x  10     16682.962     17344.027     16913.524     16928.812      166.5362
        +  10     16695.568     16923.896     16820.604     16824.652     84.816568
        No difference proven at 95.0% confidence
      
      [akpm@linux-foundation.org: fix build]
      [hughd@google.com: shmem fix]
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      456f998e
  9. 25 May, 2011 1 commit
    • Eric Paris's avatar
      tmpfs: implement generic xattr support · b09e0fa4
      Eric Paris authored
      Implement generic xattrs for tmpfs filesystems.  The Feodra project, while
      trying to replace suid apps with file capabilities, realized that tmpfs,
      which is used on the build systems, does not support file capabilities and
      thus cannot be used to build packages which use file capabilities.  Xattrs
      are also needed for overlayfs.
      
      The xattr interface is a bit odd.  If a filesystem does not implement any
      {get,set,list}xattr functions the VFS will call into some random LSM hooks
      and the running LSM can then implement some method for handling xattrs.
      SELinux for example provides a method to support security.selinux but no
      other security.* xattrs.
      
      As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have
      xattr handler routines specifically to handle acls.  Because of this tmpfs
      would loose the VFS/LSM helpers to support the running LSM.  To make up
      for that tmpfs had stub functions that did nothing but call into the LSM
      hooks which implement the helpers.
      
      This new patch does not use the LSM fallback functions and instead just
      implements a native get/set/list xattr feature for the full security.* and
      trusted.* namespace like a normal filesystem.  This means that tmpfs can
      now support both security.selinux and security.capability, which was not
      previously possible.
      
      The basic implementation is that I attach a:
      
      struct shmem_xattr {
      	struct list_head list; /* anchored by shmem_inode_info->xattr_list */
      	char *name;
      	size_t size;
      	char value[0];
      };
      
      Into the struct shmem_inode_info for each xattr that is set.  This
      implementation could easily support the user.* namespace as well, except
      some care needs to be taken to prevent large amounts of unswappable memory
      being allocated for unprivileged users.
      
      [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks]
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      Tested-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarJordi Pujol <jordipujolp@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b09e0fa4
  10. 20 May, 2011 1 commit
  11. 14 May, 2011 1 commit
    • Hugh Dickins's avatar
      tmpfs: fix race between swapoff and writepage · 05bf86b4
      Hugh Dickins authored
      Shame on me!  Commit b1dea800 "tmpfs: fix race between umount and
      writepage" fixed the advertized race, but introduced another: as even
      its comment makes clear, we cannot safely rely on a peek at list_empty()
      while holding no lock - until info->swapped is set, shmem_unuse_inode()
      may delete any formerly-swapped inode from the shmem_swaplist, which
      in this case would leave a swap area impossible to swapoff.
      
      Although I don't relish taking the mutex every time, I don't care much
      for the alternatives either; and at least the peek at list_empty() in
      shmem_evict_inode() (a hotter path since most inodes would never have
      been swapped) remains safe, because we already truncated the whole file.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05bf86b4
  12. 12 May, 2011 3 commits
    • Hugh Dickins's avatar
      tmpfs: fix spurious ENOSPC when racing with unswap · 59a16ead
      Hugh Dickins authored
      Testing the shmem_swaplist replacements for igrab() revealed another bug:
      writes to /dev/loop0 on a tmpfs file which fills its filesystem were
      sometimes failing with "Buffer I/O error"s.
      
      These came from ENOSPC failures of shmem_getpage(), when racing with
      swapoff: the same could happen when racing with another shmem_getpage(),
      pulling the page in from swap in between our find_lock_page() and our
      taking the info->lock (though not in the single-threaded loop case).
      
      This is unacceptable, and surprising that I've not noticed it before:
      it dates back many years, but (presumably) was made a lot easier to
      reproduce in 2.6.36, which sited a page preallocation in the race window.
      
      Fix it by rechecking the page cache before settling on an ENOSPC error.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59a16ead
    • Hugh Dickins's avatar
      tmpfs: fix race between umount and swapoff · 778dd893
      Hugh Dickins authored
      The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
      to umount as that in shmem_writepage().
      
      Fix this instance by extending the protection of shmem_swaplist_mutex
      right across shmem_unuse_inode(): while it's on the list, the inode cannot
      be evicted (and the filesystem cannot be unmounted) without
      shmem_evict_inode() taking that mutex to remove it from the list.
      
      But since shmem_writepage() might take that mutex, we should avoid making
      memory allocations or memcg charges while holding it: prepare them at the
      outer level in shmem_unuse().  When mem_cgroup_cache_charge() was
      originally placed, we didn't know until that point that the page from swap
      was actually a shmem page; but nowadays it's noted in the swap_map, so
      we're safe to charge upfront.  For the radix_tree, do as is done in
      shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
      habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
      on occasion if subsequently preempted.
      
      With the allocation and charge moved out from shmem_unuse_inode(),
      we can also hold index map and info->lock over from finding the entry.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      778dd893
    • Hugh Dickins's avatar
      tmpfs: fix race between umount and writepage · b1dea800
      Hugh Dickins authored
      Konstanin Khlebnikov reports that a dangerous race between umount and
      shmem_writepage can be reproduced by this script:
      
        for i in {1..300} ; do
      	mkdir $i
      	while true ; do
      		mount -t tmpfs none $i
      		dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
      		umount $i
      	done &
        done
      
      on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)
      
      Kernel log:
      
        VFS: Busy inodes after unmount of tmpfs.
                       Self-destruct in 5 seconds.  Have a nice day...
      
        WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
        Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
        Call Trace:
         warn_slowpath_common+0x80/0x98
         warn_slowpath_fmt+0x41/0x43
         __list_del_entry+0x8d/0x98
         evict+0x50/0x113
         iput+0x138/0x141
        ...
        BUG: unable to handle kernel paging request at ffffffffffffffff
        IP: shmem_free_blocks+0x18/0x4c
        Pid: 10422, comm: dd Tainted: G        W   2.6.39-rc2+ #4
        Call Trace:
         shmem_recalc_inode+0x61/0x66
         shmem_writepage+0xba/0x1dc
         pageout+0x13c/0x24c
         shrink_page_list+0x28e/0x4be
         shrink_inactive_list+0x21f/0x382
        ...
      
      shmem_writepage() calls igrab() on the inode for the page which came from
      page reclaim, to add it later into shmem_swaplist for swapoff operation.
      
      This igrab() can race with super-block deactivating process:
      
        shrink_inactive_list()          deactivate_super()
        pageout()                       tmpfs_fs_type->kill_sb()
        shmem_writepage()               kill_litter_super()
                                        generic_shutdown_super()
                                         evict_inodes()
         igrab()
                                          atomic_read(&inode->i_count)
                                           skip-inode
         iput()
                                         if (!list_empty(&sb->s_inodes))
                                                printk("VFS: Busy inodes after...
      
      This igrap-iput pair was added in commit 1b1b32f2 "tmpfs: fix
      shmem_swaplist races" based on incorrect assumptions: igrab() protects the
      inode from concurrent eviction by deletion, but it does nothing to protect
      it from concurrent unmounting, which goes ahead despite the raised
      i_count.
      
      So this use of igrab() was wrong all along, but the race made much worse
      in 2.6.37 when commit 63997e98 "split invalidate_inodes()" replaced
      two attempts at invalidate_inodes() by a single evict_inodes().
      
      Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
      whether it was correct or not; but burnt once by igrab(), I am sure that
      we don't want to rely more deeply upon externals here.
      
      Fix it by adding the inode to shmem_swaplist earlier, while the page lock
      on page in page cache still secures the inode against eviction, without
      artifically raising i_count.  It was originally added later because
      shmem_unuse_inode() is liable to remove an inode from the list while it's
      unswapped; but we can guard against that by taking spinlock before
      dropping mutex.
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1dea800
  13. 14 Apr, 2011 1 commit
  14. 23 Mar, 2011 2 commits
  15. 14 Mar, 2011 1 commit
  16. 10 Mar, 2011 1 commit
  17. 01 Mar, 2011 1 commit
  18. 01 Feb, 2011 1 commit
    • Eric Paris's avatar
      fs/vfs/security: pass last path component to LSM on inode creation · 2a7dba39
      Eric Paris authored
      SELinux would like to implement a new labeling behavior of newly created
      inodes.  We currently label new inodes based on the parent and the creating
      process.  This new behavior would also take into account the name of the
      new object when deciding the new label.  This is not the (supposed) full path,
      just the last component of the path.
      
      This is very useful because creating /etc/shadow is different than creating
      /etc/passwd but the kernel hooks are unable to differentiate these
      operations.  We currently require that userspace realize it is doing some
      difficult operation like that and than userspace jumps through SELinux hoops
      to get things set up correctly.  This patch does not implement new
      behavior, that is obviously contained in a seperate SELinux patch, but it
      does pass the needed name down to the correct LSM hook.  If no such name
      exists it is fine to pass NULL.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      2a7dba39
  19. 07 Jan, 2011 1 commit
    • Nick Piggin's avatar
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin authored
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      fa0d7e3d