1. 28 Jun, 2018 1 commit
    • Linus Torvalds's avatar
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds authored
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  2. 15 Jun, 2018 1 commit
  3. 26 May, 2018 1 commit
  4. 02 Apr, 2018 1 commit
  5. 11 Feb, 2018 1 commit
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  6. 06 Jan, 2018 3 commits
  7. 27 Nov, 2017 1 commit
  8. 20 Jun, 2017 1 commit
    • Ingo Molnar's avatar
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar authored
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: 's avatarIngo Molnar <mingo@kernel.org>
      ac6424b9
  9. 16 May, 2017 1 commit
  10. 02 Mar, 2017 1 commit
  11. 22 Mar, 2016 1 commit
    • Paolo Bonzini's avatar
      eventfd: document lockless access in eventfd_poll · a484c3dd
      Paolo Bonzini authored
      Since commit e22553e2 ("eventfd: don't take the spinlock in
      eventfd_poll", 2015-02-17), eventfd is reading ctx->count outside
      ctx->wqh.lock.
      
      However, things aren't as simple as the read barrier in eventfd_poll
      would suggest.  In fact, the read barrier, besides lacking a comment, is
      not paired in any obvious manner with another read barrier, and it is
      pointless because it is sitting between a write (deep in poll_wait) and
      the read of ctx->count.  The read barrier is acting just as a compiler
      barrier, for which we can use READ_ONCE instead.  This is what the code
      change in this patch does.
      
      The documentation change is just as important, however.  The question,
      posed by Andrea Arcangeli, is then why the thing is safe on
      architectures where spin_unlock does not imply a store-load memory
      barrier.  The answer is that it's safe because writes of ctx->count use
      the same lock as poll_wait, and hence an acquire barrier implicit in
      poll_wait provides the necessary synchronization between eventfd_poll
      and callers of wake_up_locked_poll.  This is sort of mentioned in the
      commit message with respect to eventfd_ctx_read ("eventfd_read is
      similar, it will do a single decrement with the lock held") but it
      applies to all other callers too.  It's tricky enough that it should be
      documented in the code.
      Signed-off-by: 's avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      a484c3dd
  12. 08 Dec, 2015 1 commit
  13. 17 Feb, 2015 1 commit
    • Chris Mason's avatar
      eventfd: don't take the spinlock in eventfd_poll · e22553e2
      Chris Mason authored
      The spinlock in eventfd_poll is trying to protect the count of events so
      it can decide if it should return POLLIN, POLLERR, or POLLOUT.  But,
      because of the way we drop the lock after calling poll_wait, and drop it
      again before returning, we have the same pile of races with the lock as
      we do with a single read of ctx->count().
      
      This replaces the lock with a read barrier and single read.
      
      eventfd_write does a single bump of ctx->count, so this should not add
      new races with adding events.  eventfd_read is similar, it will do a
      single decrement with the lock held, and so we're making the race with
      concurrent readers slightly larger.
      
      This spinlock is the top CPU user in kernel code during one of our
      workloads.  Removing it gives us a ~2% boost.
      
      [arnd@arndb.de: avoid unused variable warning]
      [dan.carpenter@oracle.com: type bug in eventfd_poll()]
      Signed-off-by: 's avatarChris Mason <clm@fb.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e22553e2
  14. 05 Nov, 2014 1 commit
  15. 25 Jan, 2014 1 commit
  16. 18 Dec, 2012 1 commit
  17. 01 Jun, 2012 1 commit
  18. 29 Feb, 2012 1 commit
  19. 21 Feb, 2011 1 commit
  20. 15 Oct, 2010 1 commit
    • Arnd Bergmann's avatar
      llseek: automatically add .llseek fop · 6038f373
      Arnd Bergmann authored
      All file_operations should get a .llseek operation so we can make
      nonseekable_open the default for future file operations without a
      .llseek pointer.
      
      The three cases that we can automatically detect are no_llseek, seq_lseek
      and default_llseek. For cases where we can we can automatically prove that
      the file offset is always ignored, we use noop_llseek, which maintains
      the current behavior of not returning an error from a seek.
      
      New drivers should normally not use noop_llseek but instead use no_llseek
      and call nonseekable_open at open time.  Existing drivers can be converted
      to do the same when the maintainer knows for certain that no user code
      relies on calling seek on the device file.
      
      The generated code is often incorrectly indented and right now contains
      comments that clarify for each added line why a specific variant was
      chosen. In the version that gets submitted upstream, the comments will
      be gone and I will manually fix the indentation, because there does not
      seem to be a way to do that using coccinelle.
      
      Some amount of new code is currently sitting in linux-next that should get
      the same modifications, which I will do at the end of the merge window.
      
      Many thanks to Julia Lawall for helping me learn to write a semantic
      patch that does all this.
      
      ===== begin semantic patch =====
      // This adds an llseek= method to all file operations,
      // as a preparation for making no_llseek the default.
      //
      // The rules are
      // - use no_llseek explicitly if we do nonseekable_open
      // - use seq_lseek for sequential files
      // - use default_llseek if we know we access f_pos
      // - use noop_llseek if we know we don't access f_pos,
      //   but we still want to allow users to call lseek
      //
      @ open1 exists @
      identifier nested_open;
      @@
      nested_open(...)
      {
      <+...
      nonseekable_open(...)
      ...+>
      }
      
      @ open exists@
      identifier open_f;
      identifier i, f;
      identifier open1.nested_open;
      @@
      int open_f(struct inode *i, struct file *f)
      {
      <+...
      (
      nonseekable_open(...)
      |
      nested_open(...)
      )
      ...+>
      }
      
      @ read disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      <+...
      (
         *off = E
      |
         *off += E
      |
         func(..., off, ...)
      |
         E = *off
      )
      ...+>
      }
      
      @ read_no_fpos disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ write @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      <+...
      (
        *off = E
      |
        *off += E
      |
        func(..., off, ...)
      |
        E = *off
      )
      ...+>
      }
      
      @ write_no_fpos @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ fops0 @
      identifier fops;
      @@
      struct file_operations fops = {
       ...
      };
      
      @ has_llseek depends on fops0 @
      identifier fops0.fops;
      identifier llseek_f;
      @@
      struct file_operations fops = {
      ...
       .llseek = llseek_f,
      ...
      };
      
      @ has_read depends on fops0 @
      identifier fops0.fops;
      identifier read_f;
      @@
      struct file_operations fops = {
      ...
       .read = read_f,
      ...
      };
      
      @ has_write depends on fops0 @
      identifier fops0.fops;
      identifier write_f;
      @@
      struct file_operations fops = {
      ...
       .write = write_f,
      ...
      };
      
      @ has_open depends on fops0 @
      identifier fops0.fops;
      identifier open_f;
      @@
      struct file_operations fops = {
      ...
       .open = open_f,
      ...
      };
      
      // use no_llseek if we call nonseekable_open
      ////////////////////////////////////////////
      @ nonseekable1 depends on !has_llseek && has_open @
      identifier fops0.fops;
      identifier nso ~= "nonseekable_open";
      @@
      struct file_operations fops = {
      ...  .open = nso, ...
      +.llseek = no_llseek, /* nonseekable */
      };
      
      @ nonseekable2 depends on !has_llseek @
      identifier fops0.fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...  .open = open_f, ...
      +.llseek = no_llseek, /* open uses nonseekable */
      };
      
      // use seq_lseek for sequential files
      /////////////////////////////////////
      @ seq depends on !has_llseek @
      identifier fops0.fops;
      identifier sr ~= "seq_read";
      @@
      struct file_operations fops = {
      ...  .read = sr, ...
      +.llseek = seq_lseek, /* we have seq_read */
      };
      
      // use default_llseek if there is a readdir
      ///////////////////////////////////////////
      @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier readdir_e;
      @@
      // any other fop is used that changes pos
      struct file_operations fops = {
      ... .readdir = readdir_e, ...
      +.llseek = default_llseek, /* readdir is present */
      };
      
      // use default_llseek if at least one of read/write touches f_pos
      /////////////////////////////////////////////////////////////////
      @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read.read_f;
      @@
      // read fops use offset
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = default_llseek, /* read accesses f_pos */
      };
      
      @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ... .write = write_f, ...
      +	.llseek = default_llseek, /* write accesses f_pos */
      };
      
      // Use noop_llseek if neither read nor write accesses f_pos
      ///////////////////////////////////////////////////////////
      
      @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      identifier write_no_fpos.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ...
       .write = write_f,
       .read = read_f,
      ...
      +.llseek = noop_llseek, /* read and write both use no f_pos */
      };
      
      @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write_no_fpos.write_f;
      @@
      struct file_operations fops = {
      ... .write = write_f, ...
      +.llseek = noop_llseek, /* write uses no f_pos */
      };
      
      @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      @@
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = noop_llseek, /* read uses no f_pos */
      };
      
      @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      @@
      struct file_operations fops = {
      ...
      +.llseek = noop_llseek, /* no read or write fn */
      };
      ===== End semantic patch =====
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      6038f373
  21. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: 's avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  22. 25 Jan, 2010 1 commit
  23. 22 Dec, 2009 1 commit
  24. 23 Sep, 2009 1 commit
    • Davide Libenzi's avatar
      anonfd: split interface into file creation and install · 562787a5
      Davide Libenzi authored
      Split the anonfd interface into a bare file pointer creation one, and a
      file pointer creation plus install one.
      
      There are cases, like the usage of eventfds inside other kernel
      interfaces, where the file pointer created by anonfd needs to be used
      inside the initialization of other structures.
      
      As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
      race with userspace closing the newly installed file descriptor.
      
      This patch, while keeping the old anon_inode_getfd(), introduces a new
      anon_inode_getfile() (whose services are reused in anon_inode_getfd())
      that allows to split the file creation phase and the fd install one.
      
      Once all the kernel structures are initialized, the code can call the
      proper fd_install().
      
      Gregory manifested the need for something like this inside KVM.
      Signed-off-by: 's avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: James Morris <jmorris@namei.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      Acked-by: 's avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: 's avatarRoland Dreier <rolandd@cisco.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      562787a5
  25. 01 Jul, 2009 1 commit
  26. 12 Jun, 2009 1 commit
  27. 01 Apr, 2009 2 commits
  28. 14 Jan, 2009 1 commit
  29. 24 Jul, 2008 4 commits
    • Ulrich Drepper's avatar
      flag parameters: check magic constants · e38b36f3
      Ulrich Drepper authored
      This patch adds test that ensure the boundary conditions for the various
      constants introduced in the previous patches is met.  No code is generated.
      
      [akpm@linux-foundation.org: fix alpha]
      Signed-off-by: 's avatarUlrich Drepper <drepper@redhat.com>
      Acked-by: 's avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e38b36f3
    • Ulrich Drepper's avatar
      flag parameters: NONBLOCK in eventfd · e7d476df
      Ulrich Drepper authored
      This patch adds support for the EFD_NONBLOCK flag to eventfd2.  The
      additional changes needed are minimal.
      
      The following test must be adjusted for architectures other than x86 and
      x86-64 and in case the syscall numbers changed.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <fcntl.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      
      #ifndef __NR_eventfd2
      # ifdef __x86_64__
      #  define __NR_eventfd2 290
      # elif defined __i386__
      #  define __NR_eventfd2 328
      # else
      #  error "need __NR_eventfd2"
      # endif
      #endif
      
      #define EFD_NONBLOCK O_NONBLOCK
      
      int
      main (void)
      {
        int fd = syscall (__NR_eventfd2, 1, 0);
        if (fd == -1)
          {
            puts ("eventfd2(0) failed");
            return 1;
          }
        int fl = fcntl (fd, F_GETFL);
        if (fl == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if (fl & O_NONBLOCK)
          {
            puts ("eventfd2(0) sets non-blocking mode");
            return 1;
          }
        close (fd);
      
        fd = syscall (__NR_eventfd2, 1, EFD_NONBLOCK);
        if (fd == -1)
          {
            puts ("eventfd2(EFD_NONBLOCK) failed");
            return 1;
          }
        fl = fcntl (fd, F_GETFL);
        if (fl == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if ((fl & O_NONBLOCK) == 0)
          {
            puts ("eventfd2(EFD_NONBLOCK) does not set non-blocking mode");
            return 1;
          }
        close (fd);
      
        puts ("OK");
      
        return 0;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: 's avatarUlrich Drepper <drepper@redhat.com>
      Acked-by: 's avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7d476df
    • Ulrich Drepper's avatar
      flag parameters: eventfd · b087498e
      Ulrich Drepper authored
      This patch adds the new eventfd2 syscall.  It extends the old eventfd
      syscall by one parameter which is meant to hold a flag value.  In this
      patch the only flag support is EFD_CLOEXEC which causes the close-on-exec
      flag for the returned file descriptor to be set.
      
      A new name EFD_CLOEXEC is introduced which in this implementation must
      have the same value as O_CLOEXEC.
      
      The following test must be adjusted for architectures other than x86 and
      x86-64 and in case the syscall numbers changed.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <fcntl.h>
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      
      #ifndef __NR_eventfd2
      # ifdef __x86_64__
      #  define __NR_eventfd2 290
      # elif defined __i386__
      #  define __NR_eventfd2 328
      # else
      #  error "need __NR_eventfd2"
      # endif
      #endif
      
      #define EFD_CLOEXEC O_CLOEXEC
      
      int
      main (void)
      {
        int fd = syscall (__NR_eventfd2, 1, 0);
        if (fd == -1)
          {
            puts ("eventfd2(0) failed");
            return 1;
          }
        int coe = fcntl (fd, F_GETFD);
        if (coe == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if (coe & FD_CLOEXEC)
          {
            puts ("eventfd2(0) sets close-on-exec flag");
            return 1;
          }
        close (fd);
      
        fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC);
        if (fd == -1)
          {
            puts ("eventfd2(EFD_CLOEXEC) failed");
            return 1;
          }
        coe = fcntl (fd, F_GETFD);
        if (coe == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if ((coe & FD_CLOEXEC) == 0)
          {
            puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag");
            return 1;
          }
        close (fd);
      
        puts ("OK");
      
        return 0;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      [akpm@linux-foundation.org: add sys_ni stub]
      Signed-off-by: 's avatarUlrich Drepper <drepper@redhat.com>
      Acked-by: 's avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      b087498e
    • Ulrich Drepper's avatar
      flag parameters: anon_inode_getfd extension · 7d9dbca3
      Ulrich Drepper authored
      This patch just extends the anon_inode_getfd interface to take an additional
      parameter with a flag value.  The flag value is passed on to
      get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.
      
      No actual semantic changes here, the changed callers all pass 0 for now.
      
      [akpm@linux-foundation.org: KVM fix]
      Signed-off-by: 's avatarUlrich Drepper <drepper@redhat.com>
      Acked-by: 's avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d9dbca3
  30. 01 May, 2008 1 commit
    • Al Viro's avatar
      [PATCH] sanitize anon_inode_getfd() · 2030a42c
      Al Viro authored
      a) none of the callers even looks at inode or file returned by anon_inode_getfd()
      b) any caller that would try to look at those would be racy, since by the time
      it returns we might have raced with close() from another thread and that
      file would be pining for fjords.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      2030a42c
  31. 06 Feb, 2008 1 commit
  32. 18 May, 2007 1 commit
  33. 11 May, 2007 1 commit
    • Davide Libenzi's avatar
      signal/timer/event: eventfd core · e1ad7468
      Davide Libenzi authored
      This is a very simple and light file descriptor, that can be used as event
      wait/dispatch by userspace (both wait and dispatch) and by the kernel
      (dispatch only).  It can be used instead of pipe(2) in all cases where those
      would simply be used to signal events.  Their kernel overhead is much lower
      than pipes, and they do not consume two fds.  When used in the kernel, it can
      offer an fd-bridge to enable, for example, functionalities like KAIO or
      syslets/threadlets to signal to an fd the completion of certain operations.
      But more in general, an eventfd can be used by the kernel to signal readiness,
      in a POSIX poll/select way, of interfaces that would otherwise be incompatible
      with it.  The API is:
      
      int eventfd(unsigned int count);
      
      The eventfd API accepts an initial "count" parameter, and returns an eventfd
      fd.  It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
      
      The POLLIN flag is raised when the internal counter is greater than zero.
      
      The POLLOUT flag is raised when at least a value of "1" can be written to the
      internal counter.
      
      The POLLERR flag is raised when an overflow in the counter value is detected.
      
      The write(2) operation can never overflow the counter, since it blocks (unless
      O_NONBLOCK is set, in which case -EAGAIN is returned).
      
      But the eventfd_signal() function can do it, since it's supposed to not sleep
      during its operation.
      
      The read(2) function reads the __u64 counter value, and reset the internal
      value to zero.  If the value read is equal to (__u64) -1, an overflow happened
      on the internal counter (due to 2^64 eventfd_signal() posts that has never
      been retired - unlickely, but possible).
      
      The write(2) call writes an __u64 count value, and adds it to the current
      counter.  The eventfd fd supports O_NONBLOCK also.
      
      On the kernel side, we have:
      
      struct file *eventfd_fget(int fd);
      int eventfd_signal(struct file *file, unsigned int n);
      
      The eventfd_fget() should be called to get a struct file* from an eventfd fd
      (this is an fget() + check of f_op being an eventfd fops pointer).
      
      The kernel can then call eventfd_signal() every time it wants to post an event
      to userspace.  The eventfd_signal() function can be called from any context.
      An eventfd() simple test and bench is available here:
      
      http://www.xmailserver.org/eventfd-bench.c
      
      This is the eventfd-based version of pipetest-4 (pipe(2) based):
      
      http://www.xmailserver.org/pipetest-4.c
      
      Not that performance matters much in the eventfd case, but eventfd-bench
      shows almost as double as performance than pipetest-4.
      
      [akpm@linux-foundation.org: fix i386 build]
      [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
      Signed-off-by: 's avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1ad7468