1. 04 Jan, 2019 9 commits
  2. 06 Dec, 2018 2 commits
    • Deepa Dinamani's avatar
      signal: Add restore_user_sigmask() · 854a6ed5
      Deepa Dinamani authored
      Refactor the logic to restore the sigmask before the syscall
      returns into an api.
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during
      the execution and restored after the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      854a6ed5
    • Deepa Dinamani's avatar
      signal: Add set_user_sigmask() · ded653cc
      Deepa Dinamani authored
      Refactor reading sigset from userspace and updating sigmask
      into an api.
      
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during,
      and restored after, the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll,
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Note that the calls to sigprocmask() ignored the return value
      from the api as the function only returns an error on an invalid
      first argument that is hardcoded at these call sites.
      The updated logic uses set_current_blocked() instead.
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      ded653cc
  3. 22 Aug, 2018 7 commits
  4. 28 Jun, 2018 1 commit
    • Linus Torvalds's avatar
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds authored
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  5. 15 Jun, 2018 1 commit
  6. 26 May, 2018 1 commit
  7. 02 Apr, 2018 1 commit
  8. 11 Feb, 2018 1 commit
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  9. 01 Feb, 2018 2 commits
  10. 29 Nov, 2017 2 commits
  11. 27 Nov, 2017 2 commits
  12. 18 Nov, 2017 3 commits
    • Jason Baron's avatar
      epoll: remove ep_call_nested() from ep_eventpoll_poll() · 37b5e521
      Jason Baron authored
      The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
      routine for an epoll fd, is used to prevent excessively deep epoll
      nesting, and to prevent circular paths.
      
      However, we are already preventing these conditions during
      EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
      deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
      however we don't allow more than EP_MAX_NESTS when an epoll file
      descriptor is actually connected to a wakeup source.  Thus, we do not
      require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
      called via ep_scan_ready_list() only continues nesting if there are
      events available.
      
      Since ep_call_nested() is implemented using a global lock, applications
      that make use of nested epoll can see large performance improvements
      with this change.
      
      Davidlohr said:
      
      : Improvements are quite obscene actually, such as for the following
      : epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
      :
      : ncpus  vanilla     dirty     delta
      : 1      2447092     3028315   +23.75%
      : 4      231265      2986954   +1191.57%
      : 8      121631      2898796   +2283.27%
      : 16     59749       2902056   +4757.07%
      : 32     26837	     2326314   +8568.30%
      : 64     12926       1341281   +10276.61%
      :
      : (http://linux-scalability.org/epoll/epoll-test.c)
      
      Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.comSigned-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37b5e521
    • Jason Baron's avatar
      epoll: avoid calling ep_call_nested() from ep_poll_safewake() · 57a173bd
      Jason Baron authored
      ep_poll_safewake() is used to wakeup potentially nested epoll file
      descriptors.  The function uses ep_call_nested() to prevent entering the
      same wake up queue more than once, and to prevent excessively deep
      wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
      since we are already preventing these conditions during EPOLL_CTL_ADD.
      This saves extra function calls, and avoids taking a global lock during
      the ep_call_nested() calls.
      
      I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
      case, since ep_call_nested() keeps track of the nesting level, and this
      is required by the call to spin_lock_irqsave_nested().  It would be nice
      to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
      case as well, however its not clear how to simply pass the nesting level
      through multiple wake_up() levels without more surgery.  In any case, I
      don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
      This patch, also apparently fixes a workload at Google that Salman Qazi
      reported by completely removing the poll_safewake_ncalls->lock from
      wakeup paths.
      
      Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.comSigned-off-by: default avatarJason Baron <jbaron@akamai.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57a173bd
    • Shakeel Butt's avatar
      epoll: account epitem and eppoll_entry to kmemcg · 2ae928a9
      Shakeel Butt authored
      A userspace application can directly trigger the allocations from
      eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
      can consume a significant amount of system memory by triggering such
      allocations.  Indeed we have seen in production where a buggy
      application was leaking the epoll references and causing a burst of
      eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
      charging of eventpoll_epi and eventpoll_pwq slabs.
      
      There is a per-user limit (~4% of total memory if no highmem) on these
      caches.  I think it is too generous particularly in the scenario where
      jobs of multiple users are running on the system and the administrator
      is reducing cost by overcomitting the memory.  This is unaccounted
      kernel memory and will not be considered by the oom-killer.  I think by
      accounting it to kmemcg, for systems with kmem accounting enabled, we
      can provide better isolation between jobs of different users.
      
      Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ae928a9
  13. 19 Sep, 2017 1 commit
  14. 09 Sep, 2017 1 commit
  15. 01 Sep, 2017 1 commit
    • Oleg Nesterov's avatar
      epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() · 138e4ad6
      Oleg Nesterov authored
      The race was introduced by me in commit 971316f0 ("epoll:
      ep_unregister_pollwait() can use the freed pwq->whead").  I did not
      realize that nothing can protect eventpoll after ep_poll_callback() sets
      ->whead = NULL, only whead->lock can save us from the race with
      ep_free() or ep_remove().
      
      Move ->whead = NULL to the end of ep_poll_callback() and add the
      necessary barriers.
      
      TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
      before this patch.
      
      Hopefully this explains use-after-free reported by syzcaller:
      
      	BUG: KASAN: use-after-free in debug_spin_lock_before
      	...
      	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
      	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148
      
      this is spin_lock(eventpoll->lock),
      
      	...
      	Freed by task 17774:
      	...
      	 kfree+0xe8/0x2c0 mm/slub.c:3883
      	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865
      
      Fixes: 971316f0 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
      Reported-by: default avatar范龙飞 <long7573@126.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      138e4ad6
  16. 12 Jul, 2017 3 commits
  17. 10 Jul, 2017 1 commit
    • David Rientjes's avatar
      fs, epoll: short circuit fetching events if thread has been killed · c257a340
      David Rientjes authored
      We've encountered zombies that are waiting for a thread to exit that are
      looping in ep_poll() almost endlessly although there is a pending
      SIGKILL as a result of a group exit.
      
      This happens because we always find ep_events_available() and fetch more
      events and never are able to check for signal_pending() that would break
      from the loop and return -EINTR.
      
      Special case fatal signals and break immediately to guarantee that we
      loop to fetch more events and delay making a timely exit.
      
      It would also be possible to simply move the check for signal_pending()
      higher than checking for ep_events_available(), but there have been no
      reports of delayed signal handling other than SIGKILL preventing zombies
      from exiting that would be fixed by this.
      
      It fixes an issue for us where we have witnessed zombies sticking around
      for at least O(minutes), but considering the code has been like this
      forever and nobody else has complained that I have found, I would simply
      queue it up for 4.12.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c257a340
  18. 20 Jun, 2017 1 commit
    • Ingo Molnar's avatar
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar authored
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2055da97