1. 07 Dec, 2017 1 commit
  2. 16 Nov, 2017 1 commit
    • Shakeel Butt's avatar
      fs, mm: account filp cache to kmemcg · f3f7c093
      Shakeel Butt authored
      The allocations from filp cache can be directly triggered by userspace
      applications.  A buggy application can consume a significant amount of
      unaccounted system memory.  Though we have not noticed such buggy
      applications in our production but upon close inspection, we found that
      a lot of machines spend very significant amount of memory on these
      caches.
      
      One way to limit allocations from filp cache is to set system level
      limit of maximum number of open files.  However this limit is shared
      between different users on the system and one user can hog this
      resource.  To cater that, we can charge filp to kmemcg and set the
      maximum limit very high and let the memory limit of each user limit the
      number of files they can open and indirectly limiting their allocations
      from filp cache.
      
      One side effect of this change is that it will allow _sysctl() to return
      ENOMEM and the man page of _sysctl() does not specify that.  However the
      man page also discourages to use _sysctl() at all.
      
      Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.comSigned-off-by: 's avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3f7c093
  3. 08 Nov, 2017 1 commit
  4. 28 Aug, 2017 1 commit
  5. 06 Jul, 2017 1 commit
    • Jeff Layton's avatar
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton authored
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      value.
      
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: 's avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: 's avatarJan Kara <jack@suse.cz>
      5660e13d
  6. 02 Mar, 2017 1 commit
  7. 06 Dec, 2016 1 commit
  8. 07 Aug, 2015 1 commit
    • Mel Gorman's avatar
      fs, file table: reinit files_stat.max_files after deferred memory initialisation · 4248b0da
      Mel Gorman authored
      Dave Hansen reported the following;
      
      	My laptop has been behaving strangely with 4.2-rc2.  Once I log
      	in to my X session, I start getting all kinds of strange errors
      	from applications and see this in my dmesg:
      
              	VFS: file-max limit 8192 reached
      
      The problem is that the file-max is calculated before memory is fully
      initialised and miscalculates how much memory the kernel is using.  This
      patch recalculates file-max after deferred memory initialisation.  Note
      that using memory hotplug infrastructure would not have avoided this
      problem as the value is not recalculated after memory hot-add.
      
      4.1:             files_stat.max_files = 6582781
      4.2-rc2:         files_stat.max_files = 8192
      4.2-rc2 patched: files_stat.max_files = 6562467
      
      Small differences with the patch applied and 4.1 but not enough to matter.
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Reported-by: 's avatarDave Hansen <dave.hansen@intel.com>
      Cc: Nicolai Stange <nicstange@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Alex Ng <alexng@microsoft.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      4248b0da
  9. 23 Jun, 2015 1 commit
  10. 12 Apr, 2015 1 commit
  11. 12 Oct, 2014 1 commit
  12. 08 Sep, 2014 1 commit
    • Tejun Heo's avatar
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo authored
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Acked-by: 's avatarJan Kara <jack@suse.cz>
      Acked-by: 's avatar"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  13. 06 Jun, 2014 1 commit
  14. 06 May, 2014 2 commits
    • Al Viro's avatar
      new methods: ->read_iter() and ->write_iter() · 293bc982
      Al Viro authored
      Beginning to introduce those.  Just the callers for now, and it's
      clumsier than it'll eventually become; once we finish converting
      aio_read and aio_write instances, the things will get nicer.
      
      For now, these guys are in parallel to ->aio_read() and ->aio_write();
      they take iocb and iov_iter, with everything in iov_iter already
      validated.  File offset is passed in iocb->ki_pos, iov/nr_segs -
      in iov_iter.
      
      Main concerns in that series are stack footprint and ability to
      split the damn thing cleanly.
      
      [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      293bc982
    • Al Viro's avatar
      replace checking for ->read/->aio_read presence with check in ->f_mode · 7f7f25e8
      Al Viro authored
      Since we are about to introduce new methods (read_iter/write_iter), the
      tests in a bunch of places would have to grow inconveniently.  Check
      once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
      and FMODE_CAN_WRITE resp.  It might end up being a temporary measure -
      once everything switches from ->aio_{read,write} to ->{read,write}_iter
      it might make sense to return to open-coded checks.  We'll see...
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      7f7f25e8
  15. 02 Apr, 2014 4 commits
  16. 31 Mar, 2014 1 commit
  17. 10 Mar, 2014 1 commit
    • Linus Torvalds's avatar
      vfs: atomic f_pos accesses as per POSIX · 9c225f26
      Linus Torvalds authored
      Our write() system call has always been atomic in the sense that you get
      the expected thread-safe contiguous write, but we haven't actually
      guaranteed that concurrent writes are serialized wrt f_pos accesses, so
      threads (or processes) that share a file descriptor and use "write()"
      concurrently would quite likely overwrite each others data.
      
      This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:
      
       "2.9.7 Thread Interactions with Regular File Operations
      
        All of the following functions shall be atomic with respect to each
        other in the effects specified in POSIX.1-2008 when they operate on
        regular files or symbolic links: [...]"
      
      and one of the effects is the file position update.
      
      This unprotected file position behavior is not new behavior, and nobody
      has ever cared.  Until now.  Yongzhi Pan reported unexpected behavior to
      Michael Kerrisk that was due to this.
      
      This resolves the issue with a f_pos-specific lock that is taken by
      read/write/lseek on file descriptors that may be shared across threads
      or processes.
      Reported-by: 's avatarYongzhi Pan <panyongzhi@gmail.com>
      Reported-by: 's avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      9c225f26
  18. 09 Nov, 2013 1 commit
  19. 25 Oct, 2013 1 commit
  20. 20 Oct, 2013 1 commit
    • Al Viro's avatar
      nfsd regression since delayed fput() · c7314d74
      Al Viro authored
      Background: nfsd v[23] had throughput regression since delayed fput
      went in; every read or write ends up doing fput() and we get a pair
      of extra context switches out of that (plus quite a bit of work
      in queue_work itselfi, apparently).  Use of schedule_delayed_work()
      gives it a chance to accumulate a bit before we do __fput() on all
      of them.  I'm not too happy about that solution, but... on at least
      one real-world setup it reverts about 10% throughput loss we got from
      switch to delayed fput.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      c7314d74
  21. 11 Sep, 2013 1 commit
  22. 04 Sep, 2013 1 commit
  23. 13 Jul, 2013 2 commits
  24. 29 Jun, 2013 1 commit
  25. 15 Jun, 2013 1 commit
    • Oleg Nesterov's avatar
      fput: task_work_add() can fail if the caller has passed exit_task_work() · e7b2c406
      Oleg Nesterov authored
      fput() assumes that it can't be called after exit_task_work() but
      this is not true, for example free_ipc_ns()->shm_destroy() can do
      this. In this case fput() silently leaks the file.
      
      Change it to fallback to delayed_fput_work if task_work_add() fails.
      The patch looks complicated but it is not, it changes the code from
      
      	if (PF_KTHREAD) {
      		schedule_work(...);
      		return;
      	}
      	task_work_add(...)
      
      to
      	if (!PF_KTHREAD) {
      		if (!task_work_add(...))
      			return;
      		/* fallback */
      	}
      	schedule_work(...);
      
      As for shm_destroy() in particular, we could make another fix but I
      think this change makes sense anyway. There could be another similar
      user, it is not safe to assume that task_work_add() can't fail.
      Reported-by: 's avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: 's avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      e7b2c406
  26. 02 Mar, 2013 1 commit
  27. 23 Feb, 2013 3 commits
  28. 20 Dec, 2012 1 commit
    • Jan Kara's avatar
      fs: Fix imbalance in freeze protection in mark_files_ro() · 72651cac
      Jan Kara authored
      File descriptors (even those for writing) do not hold freeze protection.
      Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
      against remount read-only. Calling mnt_drop_write_file() as we do now
      results in:
      
      [ BUG: bad unlock balance detected! ]
      3.7.0-rc6-00028-g88e75b6 #101 Not tainted
      -------------------------------------
      kworker/1:2/79 is trying to release lock (sb_writers) at:
      [<ffffffff811b33b4>] mnt_drop_write+0x24/0x30
      but there are no more locks to release!
      Reported-by: 's avatarZdenek Kabelac <zkabelac@redhat.com>
      CC: stable@vger.kernel.org
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      72651cac
  29. 10 Oct, 2012 1 commit
  30. 27 Sep, 2012 1 commit
  31. 07 Sep, 2012 1 commit
  32. 31 Jul, 2012 1 commit
  33. 29 Jul, 2012 1 commit