1. 20 Apr, 2018 2 commits
    • Arnd Bergmann's avatar
      y2038: ipc: Enable COMPAT_32BIT_TIME · b0d17578
      Arnd Bergmann authored
      Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
      take a timespec argument. After we move 32-bit architectures over to
      useing 64-bit time_t based syscalls, we need seperate entry points for
      the old 32-bit based interfaces.
      
      This changes the #ifdef guards for the existing 32-bit compat syscalls
      to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
      enabled on all existing 32-bit architectures.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      b0d17578
    • Arnd Bergmann's avatar
      y2038: ipc: Use __kernel_timespec · 21fc538d
      Arnd Bergmann authored
      This is a preparatation for changing over __kernel_timespec to 64-bit
      times, which involves assigning new system call numbers for mq_timedsend(),
      mq_timedreceive() and semtimedop() for compatibility with future y2038
      proof user space.
      
      The existing ABIs will remain available through compat code.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      21fc538d
  2. 25 Mar, 2018 1 commit
    • Eric W. Biederman's avatar
      Revert "mqueue: switch to on-demand creation of internal mount" · cfb2f6f6
      Eric W. Biederman authored
      This reverts commit 36735a6a.
      
      Aleksa Sarai <asarai@suse.de> writes:
      > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
      >
      > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
      > which was introduced by 36735a6a ("mqueue: switch to on-demand
      > creation of internal mount").
      >
      > Basically, the reproducer boils down to being able to mount mqueue if
      > you create a new user namespace, even if you don't unshare the IPC
      > namespace.
      >
      > Previously this was not possible, and you would get an -EPERM. The mount
      > is the *host* mqueue mount, which is being cached and just returned from
      > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
      > it was intentional -- since I'm not familiar with mqueue).
      >
      > To me it looks like there is a missing permission check. I've included a
      > patch below that I've compile-tested, and should block the above case.
      > Can someone please tell me if I'm missing something? Is this actually
      > safe?
      >
      > [1]: https://github.com/docker/docker/issues/36674
      
      The issue is a lot deeper than a missing permission check.  sb->s_user_ns
      was is improperly set as well.  So in addition to the filesystem being
      mounted when it should not be mounted, so things are not allow that should
      be.
      
      We are practically to the release of 4.16 and there is no agreement between
      Al Viro and myself on what the code should looks like to fix things properly.
      So revert the code to what it was before so that we can take our time
      and discuss this properly.
      
      Fixes: 36735a6a ("mqueue: switch to on-demand creation of internal mount")
      Reported-by: default avatarFelix Abecassis <fabecassis@nvidia.com>
      Reported-by: default avatarAleksa Sarai <asarai@suse.de>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      cfb2f6f6
  3. 11 Feb, 2018 1 commit
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  4. 07 Feb, 2018 1 commit
  5. 12 Jan, 2018 1 commit
    • Eric W. Biederman's avatar
      signal: Ensure generic siginfos the kernel sends have all bits initialized · faf1f22b
      Eric W. Biederman authored
      Call clear_siginfo to ensure stack allocated siginfos are fully
      initialized before being passed to the signal sending functions.
      
      This ensures that if there is the kind of confusion documented by
      TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
      data to userspace when the kernel generates a signal with SI_USER but
      the copy to userspace assumes it is a different kind of signal, and
      different fields are initialized.
      
      This also prepares the way for turning copy_siginfo_to_user
      into a copy_to_user, by removing the need in many cases to perform
      a field by field copy simply to skip the uninitialized fields.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      faf1f22b
  6. 05 Jan, 2018 7 commits
  7. 27 Nov, 2017 2 commits
    • Al Viro's avatar
      ipc, kernel, mm: annotate ->poll() instances · 9dd95748
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9dd95748
    • Linus Torvalds's avatar
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds authored
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  8. 04 Sep, 2017 1 commit
  9. 09 Jul, 2017 1 commit
    • Cong Wang's avatar
      mqueue: fix a use-after-free in sys_mq_notify() · f991af3d
      Cong Wang authored
      The retry logic for netlink_attachskb() inside sys_mq_notify()
      is nasty and vulnerable:
      
      1) The sock refcnt is already released when retry is needed
      2) The fd is controllable by user-space because we already
         release the file refcnt
      
      so we when retry but the fd has been just closed by user-space
      during this small window, we end up calling netlink_detachskb()
      on the error path which releases the sock again, later when
      the user-space closes this socket a use-after-free could be
      triggered.
      
      Setting 'sock' to NULL here should be sufficient to fix it.
      Reported-by: default avatarGeneBlue <geneblue.mail@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f991af3d
  10. 04 Jul, 2017 1 commit
  11. 02 Mar, 2017 3 commits
  12. 28 Feb, 2017 1 commit
  13. 21 Nov, 2016 1 commit
  14. 28 Sep, 2016 1 commit
  15. 23 Jun, 2016 3 commits
    • Eric W. Biederman's avatar
      vfs: Generalize filesystem nodev handling. · a2982cc9
      Eric W. Biederman authored
      Introduce a function may_open_dev that tests MNT_NODEV and a new
      superblock flab SB_I_NODEV.  Use this new function in all of the
      places where MNT_NODEV was previously tested.
      
      Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
      filesystems should never support device nodes, and a simple superblock
      flags makes that very hard to get wrong.  With SB_I_NODEV set if any
      device nodes somehow manage to show up on on a filesystem those
      device nodes will be unopenable.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a2982cc9
    • Eric W. Biederman's avatar
      ipc/mqueue: The mqueue filesystem should never contain executables · 3ee69014
      Eric W. Biederman authored
      Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
      do not result in executable on mqueuefs by accident.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3ee69014
    • Eric W. Biederman's avatar
      vfs: Pass data, ns, and ns->userns to mount_ns · d91ee87d
      Eric W. Biederman authored
      Today what is normally called data (the mount options) is not passed
      to fill_super through mount_ns.
      
      Pass the mount options and the namespace separately to mount_ns so
      that filesystems such as proc that have mount options, can use
      mount_ns.
      
      Pass the user namespace to mount_ns so that the standard permission
      check that verifies the mounter has permissions over the namespace can
      be performed in mount_ns instead of in each filesystems .mount method.
      Thus removing the duplication between mqueuefs and proc in terms of
      permission checks.  The extra permission check does not currently
      affect the rpc_pipefs filesystem and the nfsd filesystem as those
      filesystems do not currently allow unprivileged mounts.  Without
      unpvileged mounts it is guaranteed that the caller has already passed
      capable(CAP_SYS_ADMIN) which guarantees extra permission check will
      pass.
      
      Update rpc_pipefs and the nfsd filesystem to ensure that the network
      namespace reference is always taken in fill_super and always put in kill_sb
      so that the logic is simpler and so that errors originating inside of
      fill_super do not cause a network namespace leak.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d91ee87d
  16. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  17. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  18. 15 Jan, 2016 1 commit
    • Vladimir Davydov's avatar
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov authored
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  19. 07 Aug, 2015 1 commit
    • Marcus Gelderie's avatar
      ipc: modify message queue accounting to not take kernel data structures into account · de54b9ac
      Marcus Gelderie authored
      A while back, the message queue implementation in the kernel was
      improved to use btrees to speed up retrieval of messages, in commit
      d6629859 ("ipc/mqueue: improve performance of send/recv").
      
      That patch introducing the improved kernel handling of message queues
      (using btrees) has, as a by-product, changed the meaning of the QSIZE
      field in the pseudo-file created for the queue.  Before, this field
      reflected the size of the user-data in the queue.  Since, it also takes
      kernel data structures into account.  For example, if 13 bytes of user
      data are in the queue, on my machine the file reports a size of 61
      bytes.
      
      There was some discussion on this topic before (for example
      https://lkml.org/lkml/2014/10/1/115).  Commenting on a th lkml, Michael
      Kerrisk gave the following background
      (https://lkml.org/lkml/2015/6/16/74):
      
          The pseudofiles in the mqueue filesystem (usually mounted at
          /dev/mqueue) expose fields with metadata describing a message
          queue. One of these fields, QSIZE, as originally implemented,
          showed the total number of bytes of user data in all messages in
          the message queue, and this feature was documented from the
          beginning in the mq_overview(7) page. In 3.5, some other (useful)
          work happened to break the user-space API in a couple of places,
          including the value exposed via QSIZE, which now includes a measure
          of kernel overhead bytes for the queue, a figure that renders QSIZE
          useless for its original purpose, since there's no way to deduce
          the number of overhead bytes consumed by the implementation.
          (The other user-space breakage was subsequently fixed.)
      
      This patch removes the accounting of kernel data structures in the
      queue.  Reporting the size of these data-structures in the QSIZE field
      was a breaking change (see Michael's comment above).  Without the QSIZE
      field reporting the total size of user-data in the queue, there is no
      way to deduce this number.
      
      It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
      against the worst-case size of the queue (in both the old and the new
      implementation).  Therefore, the kernel overhead accounting in QSIZE is
      not necessary to help the user understand the limitations RLIMIT imposes
      on the processes.
      Signed-off-by: default avatarMarcus Gelderie <redmnic@gmail.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: John Duffy <jb_duffy@btinternet.com>
      Cc: Arto Bendiken <arto@bendiken.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de54b9ac
  20. 08 May, 2015 1 commit
    • Davidlohr Bueso's avatar
      ipc/mqueue: Implement lockless pipelined wakeups · fa6004ad
      Davidlohr Bueso authored
      This patch moves the wakeup_process() invocation so it is not done under
      the info->lock by making use of a lockless wake_q. With this change, the
      waiter is woken up once it is STATE_READY and it does not need to loop
      on SMP if it is still in STATE_PENDING. In the timeout case we still need
      to grab the info->lock to verify the state.
      
      This change should also avoid the introduction of preempt_disable() in -rt
      which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
      change if the waiter has a higher priority compared to the waker.
      
      Additionally, this patch micro-optimizes wq_sleep by using the cheaper
      cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
      matter what, thus get rid of the implied barrier.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarGeorge Spelvin <linux@horizon.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fa6004ad
  21. 15 Apr, 2015 1 commit
  22. 19 Nov, 2014 1 commit
    • Al Viro's avatar
      new helper: audit_file() · 9f45f5bf
      Al Viro authored
      ... for situations when we don't have any candidate in pathnames - basically,
      in descriptor-based syscalls.
      
      [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9f45f5bf
  23. 07 Apr, 2014 1 commit
  24. 25 Feb, 2014 1 commit
    • Davidlohr Bueso's avatar
      ipc,mqueue: remove limits for the amount of system-wide queues · f3713fd9
      Davidlohr Bueso authored
      Commit 93e6f119 ("ipc/mqueue: cleanup definition names and
      locations") added global hardcoded limits to the amount of message
      queues that can be created.  While these limits are per-namespace,
      reality is that it ends up breaking userspace applications.
      Historically users have, at least in theory, been able to create up to
      INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
      for some workloads and use cases.  For instance, Madars reports:
      
       "This update imposes bad limits on our multi-process application.  As
        our app uses approaches that each process opens its own set of queues
        (usually something about 3-5 queues per process).  In some scenarios
        we might run up to 3000 processes or more (which of-course for linux
        is not a problem).  Thus we might need up to 9000 queues or more.  All
        processes run under one user."
      
      Other affected users can be found in launchpad bug #1155695:
        https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695
      
      Instead of increasing this limit, revert it entirely and fallback to the
      original way of dealing queue limits -- where once a user's resource
      limit is reached, and all memory is used, new queues cannot be created.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: default avatarMadars Vitolins <m@silodev.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3713fd9
  25. 28 Jan, 2014 2 commits
  26. 09 Nov, 2013 1 commit
    • J. Bruce Fields's avatar
      locks: break delegations on unlink · b21996e3
      J. Bruce Fields authored
      We need to break delegations on any operation that changes the set of
      links pointing to an inode.  Start with unlink.
      
      Such operations also hold the i_mutex on a parent directory.  Breaking a
      delegation may require waiting for a timeout (by default 90 seconds) in
      the case of a unresponsive NFS client.  To avoid blocking all directory
      operations, we therefore drop locks before waiting for the delegation.
      The logic then looks like:
      
      	acquire locks
      	...
      	test for delegation; if found:
      		take reference on inode
      		release locks
      		wait for delegation break
      		drop reference on inode
      		retry
      
      It is possible this could never terminate.  (Even if we take precautions
      to prevent another delegation being acquired on the same inode, we could
      get a different inode on each retry.)  But this seems very unlikely.
      
      The initial test for a delegation happens after the lock on the target
      inode is acquired, but the directory inode may have been acquired
      further up the call stack.  We therefore add a "struct inode **"
      argument to any intervening functions, which we use to pass the inode
      back up to the caller in the case it needs a delegation synchronously
      broken.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b21996e3
  27. 09 Jul, 2013 1 commit
    • Jeff Layton's avatar
      audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record · 79f6530c
      Jeff Layton authored
      The old audit PATH records for mq_open looked like this:
      
        type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777
        dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
        obj=system_u:object_r:tmpfs_t:s15:c0.c1023
        type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732
        dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
        obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
      
      ...with the audit related changes that went into 3.7, they now look like this:
      
        type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655
        dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
        obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
        type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926
        dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
        obj=system_u:object_r:tmpfs_t:s15:c0.c1023
        type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq"
      
      Both of these look wrong to me.  As Steve Grubb pointed out:
      
       "What we need is 1 PATH record that identifies the MQ.  The other PATH
        records probably should not be there."
      
      Fix it to record the mq root as a parent, and flag it such that it
      should be hidden from view when the names are logged, since the root of
      the mq filesystem isn't terribly interesting.  With this change, we get
      a single PATH record that looks more like this:
      
        type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914
        dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00
        obj=unconfined_u:object_r:user_tmpfs_t:s0
      
      In order to do this, a new audit_inode_parent_hidden() function is
      added.  If we do it this way, then we avoid having the existing callers
      of audit_inode needing to do any sort of flag conversion if auditing is
      inactive.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Reported-by: default avatarJiri Jaburek <jjaburek@redhat.com>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79f6530c