1. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: 's avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: 's avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: 's avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 30 Jun, 2017 1 commit
    • Kees Cook's avatar
      randstruct: Mark various structs for randomization · 3859a271
      Kees Cook authored
      This marks many critical kernel structures for randomization. These are
      structures that have been targeted in the past in security exploits, or
      contain functions pointers, pointers to function pointer tables, lists,
      workqueues, ref-counters, credentials, permissions, or are otherwise
      sensitive. This initial list was extracted from Brad Spengler/PaX Team's
      code in the last public patch of grsecurity/PaX based on my understanding
      of the code. Changes or omissions from the original code are mine and
      don't reflect the original grsecurity/PaX code.
      Left out of this list is task_struct, which requires special handling
      and will be covered in a subsequent patch.
      Signed-off-by: 's avatarKees Cook <keescook@chromium.org>
  3. 23 May, 2017 2 commits
    • Eric W. Biederman's avatar
      mnt: In propgate_umount handle visiting mounts in any order · 99b19d16
      Eric W. Biederman authored
      While investigating some poor umount performance I realized that in
      the case of overlapping mount trees where some of the mounts are locked
      the code has been failing to unmount all of the mounts it should
      have been unmounting.
      This failure to unmount all of the necessary
      mounts can be reproduced with:
      $ cat locked_mounts_test.sh
      mount -t tmpfs test-base /mnt
      mount --make-shared /mnt
      mkdir -p /mnt/b
      mount -t tmpfs test1 /mnt/b
      mount --make-shared /mnt/b
      mkdir -p /mnt/b/10
      mount -t tmpfs test2 /mnt/b/10
      mount --make-shared /mnt/b/10
      mkdir -p /mnt/b/10/20
      mount --rbind /mnt/b /mnt/b/10/20
      unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
      sleep 1
      umount -l /mnt/b
      wait %%
      $ unshare -Urm ./locked_mounts_test.sh
      This failure is corrected by removing the prepass that marks mounts
      that may be umounted.
      A first pass is added that umounts mounts if possible and if not sets
      mount mark if they could be unmounted if they weren't locked and adds
      them to a list to umount possibilities.  This first pass reconsiders
      the mounts parent if it is on the list of umount possibilities, ensuring
      that information of umoutability will pass from child to mount parent.
      A second pass then walks through all mounts that are umounted and processes
      their children unmounting them or marking them for reparenting.
      A last pass cleans up the state on the mounts that could not be umounted
      and if applicable reparents them to their first parent that remained
      While a bit longer than the old code this code is much more robust
      as it allows information to flow up from the leaves and down
      from the trunk making the order in which mounts are encountered
      in the umount propgation tree irrelevant.
      Cc: stable@vger.kernel.org
      Fixes: 0c56fe31 ("mnt: Don't propagate unmounts to locked mounts")
      Reviewed-by: 's avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      mnt: In umount propagation reparent in a separate pass · 570487d3
      Eric W. Biederman authored
      It was observed that in some pathlogical cases that the current code
      does not unmount everything it should.  After investigation it
      was determined that the issue is that mnt_change_mntpoint can
      can change which mounts are available to be unmounted during mount
      propagation which is wrong.
      The trivial reproducer is:
      $ cat ./pathological.sh
      mount -t tmpfs test-base /mnt
      cd /mnt
      mkdir 1 2 1/1
      mount --bind 1 1
      mount --make-shared 1
      mount --bind 1 2
      mount --bind 1/1 1/1
      mount --bind 1/1 1/1
      grep test-base /proc/self/mountinfo
      umount 1/1
      grep test-base /proc/self/mountinfo
      $ unshare -Urm ./pathological.sh
      The expected output looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      The output without the fix looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      That last mount in the output was in the propgation tree to be unmounted but
      was missed because the mnt_change_mountpoint changed it's parent before the walk
      through the mount propagation tree observed it.
      Cc: stable@vger.kernel.org
      Fixes: 1064f874 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
      Acked-by: 's avatarAndrei Vagin <avagin@virtuozzo.com>
      Reviewed-by: 's avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
  4. 10 Apr, 2017 2 commits
    • Jan Kara's avatar
      fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83
      Jan Kara authored
      Currently we free fsnotify_mark_connector structure only when inode /
      vfsmount is getting freed. This can however impose noticeable memory
      overhead when marks get attached to inodes only temporarily. So free the
      connector structure once the last mark is detached from the object.
      Since notification infrastructure can be working with the connector
      under the protection of fsnotify_mark_srcu, we have to be careful and
      free the fsnotify_mark_connector only after SRCU period passes.
      Reviewed-by: 's avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: 's avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
    • Jan Kara's avatar
      fsnotify: Move mark list head from object into dedicated structure · 9dd813c1
      Jan Kara authored
      Currently notification marks are attached to object (inode or vfsmnt) by
      a hlist_head in the object. The list is also protected by a spinlock in
      the object. So while there is any mark attached to the list of marks,
      the object must be pinned in memory (and thus e.g. last iput() deleting
      inode cannot happen). Also for list iteration in fsnotify() to work, we
      must hold fsnotify_mark_srcu lock so that mark itself and
      mark->obj_list.next cannot get freed. Thus we are required to wait for
      response to fanotify events from userspace process with
      fsnotify_mark_srcu lock held. That causes issues when userspace process
      is buggy and does not reply to some event - basically the whole
      notification subsystem gets eventually stuck.
      So to be able to drop fsnotify_mark_srcu lock while waiting for
      response, we have to pin the mark in memory and make sure it stays in
      the object list (as removing the mark waiting for response could lead to
      lost notification events for groups later in the list). However we don't
      want inode reclaim to block on such mark as that would lead to system
      just locking up elsewhere.
      This commit is the first in the series that paves way towards solving
      these conflicting lifetime needs. Instead of anchoring the list of marks
      directly in the object, we anchor it in a dedicated structure
      (fsnotify_mark_connector) and just point to that structure from the
      object. The following commits will also add spinlock protecting the list
      and object pointer to the structure.
      Reviewed-by: 's avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: 's avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
  5. 03 Feb, 2017 1 commit
    • Eric W. Biederman's avatar
      mnt: Tuck mounts under others instead of creating shadow/side mounts. · 1064f874
      Eric W. Biederman authored
      Ever since mount propagation was introduced in cases where a mount in
      propagated to parent mount mountpoint pair that is already in use the
      code has placed the new mount behind the old mount in the mount hash
      This implementation detail is problematic as it allows creating
      arbitrary length mount hash chains.
      Furthermore it invalidates the constraint maintained elsewhere in the
      mount code that a parent mount and a mountpoint pair will have exactly
      one mount upon them.  Making it hard to deal with and to talk about
      this special case in the mount code.
      Modify mount propagation to notice when there is already a mount at
      the parent mount and mountpoint where a new mount is propagating to
      and place that preexisting mount on top of the new mount.
      Modify unmount propagation to notice when a mount that is being
      unmounted has another mount on top of it (and no other children), and
      to replace the unmounted mount with the mount on top of it.
      Move the MNT_UMUONT test from __lookup_mnt_last into
      __propagate_umount as that is the only call of __lookup_mnt_last where
      MNT_UMOUNT may be set on any mount visible in the mount hash table.
      These modifications allow:
       - __lookup_mnt_last to be removed.
       - attach_shadows to be renamed __attach_mnt and its shadow
         handling to be removed.
       - commit_tree to be simplified
       - copy_tree to be simplified
      The result is an easier to understand tree of mounts that does not
      allow creation of arbitrary length hash chains in the mount hash table.
      The result is also a very slight userspace visible difference in semantics.
      The following two cases now behave identically, where before order
      case 1: (explicit user action)
      	B is a slave of A
      	mount something on A/a , it will propagate to B/a
      	and than mount something on B/a
      case 2: (tucked mount)
      	B is a slave of A
      	mount something on B/a
      	and than mount something on A/a
      Histroically umount A/a would fail in case 1 and succeed in case 2.
      Now umount A/a succeeds in both configurations.
      This very small change in semantics appears if anything to be a bug
      fix to me and my survey of userspace leads me to believe that no programs
      will notice or care of this subtle semantic change.
      v2: Updated to mnt_change_mountpoint to not call dput or mntput
      and instead to decrement the counts directly.  It is guaranteed
      that there will be other references when mnt_change_mountpoint is
      called so this is safe.
      v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
          As the locking in fs/namespace.c changed between v2 and v3.
      v4: Reworked the logic in propagate_mount_busy and __propagate_umount
          that detects when a mount completely covers another mount.
      v5: Removed unnecessary tests whose result is alwasy true in
          find_topper and attach_recursive_mnt.
      v6: Document the user space visible semantic difference.
      Cc: stable@vger.kernel.org
      Fixes: b90fa9ae ("[PATCH] shared mount handling: bind and rbind")
      Tested-by: 's avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
  6. 04 Dec, 2016 1 commit
  7. 30 Sep, 2016 1 commit
    • Eric W. Biederman's avatar
      mnt: Add a per mount namespace limit on the number of mounts · d2921684
      Eric W. Biederman authored
      CAI Qian <caiqian@redhat.com> pointed out that the semantics
      of shared subtrees make it possible to create an exponentially
      increasing number of mounts in a mount namespace.
          mkdir /tmp/1 /tmp/2
          mount --make-rshared /
          for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done
      Will create create 2^20 or 1048576 mounts, which is a practical problem
      as some people have managed to hit this by accident.
      As such CVE-2016-6213 was assigned.
      Ian Kent <raven@themaw.net> described the situation for autofs users
      as follows:
      > The number of mounts for direct mount maps is usually not very large because of
      > the way they are implemented, large direct mount maps can have performance
      > problems. There can be anywhere from a few (likely case a few hundred) to less
      > than 10000, plus mounts that have been triggered and not yet expired.
      > Indirect mounts have one autofs mount at the root plus the number of mounts that
      > have been triggered and not yet expired.
      > The number of autofs indirect map entries can range from a few to the common
      > case of several thousand and in rare cases up to between 30000 and 50000. I've
      > not heard of people with maps larger than 50000 entries.
      > The larger the number of map entries the greater the possibility for a large
      > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
      > more active mounts.
      So I am setting the default number of mounts allowed per mount
      namespace at 100,000.  This is more than enough for any use case I
      know of, but small enough to quickly stop an exponential increase
      in mounts.  Which should be perfect to catch misconfigurations and
      malfunctioning programs.
      For anyone who needs a higher limit this can be changed by writing
      to the new /proc/sys/fs/mount-max sysctl.
      Tested-by: 's avatarCAI Qian <caiqian@redhat.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
  8. 31 Aug, 2016 1 commit
  9. 01 Jul, 2015 1 commit
    • Yann Droneaud's avatar
      fs: use seq_open_private() for proc_mounts · ede1bf0d
      Yann Droneaud authored
      A patchset to remove support for passing pre-allocated struct seq_file to
      seq_open().  Such feature is undocumented and prone to error.
      In particular, if seq_release() is used in release handler, it will
      kfree() a pointer which was not allocated by seq_open().
      So this patchset drops support for pre-allocated struct seq_file: it's
      only of use in proc_namespace.c and can be easily replaced by using
      Additionally, it documents the use of file->private_data to hold pointer
      to struct seq_file by seq_open().
      This patch (of 3):
      Since patch described below, from v2.6.15-rc1, seq_open() could use a
      struct seq_file already allocated by the caller if the pointer to the
      structure is stored in file->private_data before calling the function.
          Commit 1abe77b0
          Author: Al Viro <viro@zeniv.linux.org.uk>
          Date:   Mon Nov 7 17:15:34 2005 -0500
              [PATCH] allow callers of seq_open do allocation themselves
              Allow caller of seq_open() to kmalloc() seq_file + whatever else they
              want and set ->private_data to it.  seq_open() will then abstain from
              doing allocation itself.
      Such behavior is only used by mounts_open_common().
      In order to drop support for such uncommon feature, proc_mounts is
      converted to use seq_open_private(), which take care of allocating the
      proc_mounts structure, making it available through ->private in struct
      Conversely, proc_mounts is converted to use seq_release_private(), in
      order to release the private structure allocated by seq_open_private().
      Then, ->private is used directly instead of proc_mounts() macro to access
      to the proc_mounts structure.
      Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.comSigned-off-by: 's avatarYann Droneaud <ydroneaud@opteya.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 11 May, 2015 1 commit
    • Al Viro's avatar
      new helper: __legitimize_mnt() · 294d71ff
      Al Viro authored
      same as legitimize_mnt(), except that it does *not* drop and regain
      rcu_read_lock; return values are
      0  =>  grabbed a reference, we are fine
      1  =>  failed, just go away
      -1 =>  failed, go away and mntput(bastard) when outside of rcu_read_lock
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  11. 26 Jan, 2015 1 commit
  12. 04 Dec, 2014 1 commit
  13. 09 Oct, 2014 4 commits
    • Eric W. Biederman's avatar
      vfs: Add a function to lazily unmount all mounts from any dentry. · 80b5dce8
      Eric W. Biederman authored
      The new function detach_mounts comes in two pieces.  The first piece
      is a static inline test of d_mounpoint that returns immediately
      without taking any locks if d_mounpoint is not set.  In the common
      case when mountpoints are absent this allows the vfs to continue
      running with it's same cacheline foot print.
      The second piece of detach_mounts __detach_mounts actually does the
      work and it assumes that a mountpoint is present so it is slow and
      takes namespace_sem for write, and then locks the mount hash (aka
      mount_lock) after a struct mountpoint has been found.
      With those two locks held each entry on the list of mounts on a
      mountpoint is selected and lazily unmounted until all of the mount
      have been lazily unmounted.
      v7: Wrote a proper change description and removed the changelog
          documenting deleted wrong turns.
      Signed-off-by: 's avatarEric W. Biederman <ebiederman@twitter.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
    • Eric W. Biederman's avatar
      vfs: Keep a list of mounts on a mount point · 0a5eb7c8
      Eric W. Biederman authored
      To spot any possible problems call BUG if a mountpoint
      is put when it's list of mounts is not empty.
      AV: use hlist instead of list_head
      Reviewed-by: 's avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: 's avatarEric W. Biederman <ebiederman@twitter.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
    • Eric W. Biederman's avatar
      vfs: Don't allow overwriting mounts in the current mount namespace · 7af1364f
      Eric W. Biederman authored
      In preparation for allowing mountpoints to be renamed and unlinked
      in remote filesystems and in other mount namespaces test if on a dentry
      there is a mount in the local mount namespace before allowing it to
      be renamed or unlinked.
      The primary motivation here are old versions of fusermount unmount
      which is not safe if the a path can be renamed or unlinked while it is
      verifying the mount is safe to unmount.  More recent versions are simpler
      and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
      in a directory owned by an arbitrary user.
      Miklos Szeredi <miklos@szeredi.hu> reports this is approach is good
      enough to remove concerns about new kernels mixed with old versions
      of fusermount.
      A secondary motivation for restrictions here is that it removing empty
      directories that have non-empty mount points on them appears to
      violate the rule that rmdir can not remove empty directories.  As
      Linus Torvalds pointed out this is useful for programs (like git) that
      test if a directory is empty with rmdir.
      Therefore this patch arranges to enforce the existing mount point
      semantics for local mount namespace.
      v2: Rewrote the test to be a drop in replacement for d_mountpoint
      v3: Use bool instead of int as the return type of is_local_mountpoint
      Reviewed-by: 's avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      delayed mntput · 9ea459e1
      Al Viro authored
      On final mntput() we want fs shutdown to happen before return to
      userland; however, the only case where we want it happen right
      there (i.e. where task_work_add won't do) is MNT_INTERNAL victim.
      Those have to be fully synchronous - failure halfway through module
      init might count on having vfsmount killed right there.  Fortunately,
      final mntput on MNT_INTERNAL vfsmounts happens on shallow stack.
      So we handle those synchronously and do an analog of delayed fput
      logics for everything else.
      As the result, we are guaranteed that fs shutdown will always happen
      on shallow stack.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  14. 07 Aug, 2014 2 commits
    • Al Viro's avatar
      death to mnt_pinned · 3064c356
      Al Viro authored
      Rather than playing silly buggers with vfsmount refcounts, just have
      acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
      and replace it with said clone.  Then attach the pin to original
      vfsmount.  Voila - the clone will be alive until the file gets closed,
      making sure that underlying superblock remains active, etc., and
      we can drop the original vfsmount, so that it's not kept busy.
      If the file lives until the final mntput of the original vfsmount,
      we'll notice that there's an fs_pin (one in bsd_acct_struct that
      holds that file) and mnt_pin_kill() will take it out.  Since
      ->kill() is synchronous, we won't proceed past that point until
      these files are closed (and private clones of our vfsmount are
      gone), so we get the same ordering warranties we used to get.
      mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
      it never became usable outside of kernel/acct.c (and racy wrt
      umount even there).
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      acct: get rid of acct_list · 215752fc
      Al Viro authored
      Put these suckers on per-vfsmount and per-superblock lists instead.
      Note: right now it's still acct_lock for everything, but that's
      going to change.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  15. 02 Apr, 2014 1 commit
  16. 30 Mar, 2014 2 commits
    • Al Viro's avatar
      switch mnt_hash to hlist · 38129a13
      Al Viro authored
      fixes RCU bug - walking through hlist is safe in face of element moves,
      since it's self-terminating.  Cyclic lists are not - if we end up jumping
      to another hash chain, we'll loop infinitely without ever hitting the
      original list head.
      [fix for dumb braino folded]
      Spotted by: Max Kellermann <mk@cm4all.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      resizable namespace.c hashes · 0818bf27
      Al Viro authored
      * switch allocation to alloc_large_system_hash()
      * make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
      * switch mountpoint_hashtable from list_head to hlist_head
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  17. 26 Jan, 2014 1 commit
  18. 09 Nov, 2013 1 commit
    • Al Viro's avatar
      RCU'd vfsmounts · 48a066e7
      Al Viro authored
      * RCU-delayed freeing of vfsmounts
      * vfsmount_lock replaced with a seqlock (mount_lock)
      * sequence number from mount_lock is stored in nameidata->m_seq and
      used when we exit RCU mode
      * new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
      caller knows that vfsmount will have no surviving references.
      * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
      and doing pending mntput().
      * new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
      number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
      again to close the race and either returns success or drops the reference it
      has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
      simply decrement the refcount and sod off - aforementioned synchronize_rcu()
      makes sure that final mntput() won't come until we leave RCU mode.  We need
      that, since we don't want to end up with some lazy pathwalk racing with
      umount() and stealing the final mntput() from it - caller of umount() may
      expect it to return only once the fs is shut down and we don't want to break
      that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
      full-blown mntput() in case of mount_lock sequence number mismatch happening
      just as we'd grabbed the reference, but in those cases we won't be stealing
      the final mntput() from anything that would care.
      * mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
      SMP and UP cases are handled the same way - no ifdefs there.
      * normal pathname resolution does *not* do any writes to mount_lock.  It does,
      of course, bump the refcounts of vfsmount and dentry in the very end, but that's
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
  19. 25 Oct, 2013 3 commits
  20. 09 Apr, 2013 1 commit
  21. 20 Nov, 2012 1 commit
    • Eric W. Biederman's avatar
      proc: Usable inode numbers for the namespace file descriptors. · 98f842e6
      Eric W. Biederman authored
      Assign a unique proc inode to each namespace, and use that
      inode number to ensure we only allocate at most one proc
      inode for every namespace in proc.
      A single proc inode per namespace allows userspace to test
      to see if two processes are in the same namespace.
      This has been a long requested feature and only blocked because
      a naive implementation would put the id in a global space and
      would ultimately require having a namespace for the names of
      namespaces, making migration and certain virtualization tricks
      We still don't have per superblock inode numbers for proc, which
      appears necessary for application unaware checkpoint/restart and
      migrations (if the application is using namespace file descriptors)
      but that is now allowd by the design if it becomes important.
      I have preallocated the ipc and uts initial proc inode numbers so
      their structures can be statically initialized.
      Signed-off-by: 's avatarEric W. Biederman <ebiederm@xmission.com>
  22. 19 Nov, 2012 2 commits
    • Eric W. Biederman's avatar
      vfs: Add a user namespace reference from struct mnt_namespace · 771b1371
      Eric W. Biederman authored
      This will allow for support for unprivileged mounts in a new user namespace.
      Acked-by: 's avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      vfs: Add setns support for the mount namespace · 8823c079
      Eric W. Biederman authored
      setns support for the mount namespace is a little tricky as an
      arbitrary decision must be made about what to set fs->root and
      fs->pwd to, as there is no expectation of a relationship between
      the two mount namespaces.  Therefore I arbitrarily find the root
      mount point, and follow every mount on top of it to find the top
      of the mount stack.  Then I set fs->root and fs->pwd to that
      location.  The topmost root of the mount stack seems like a
      reasonable place to be.
      Bind mount support for the mount namespace inodes has the
      possibility of creating circular dependencies between mount
      namespaces.  Circular dependencies can result in loops that
      prevent mount namespaces from every being freed.  I avoid
      creating those circular dependencies by adding a sequence number
      to the mount namespace and require all bind mounts be of a
      younger mount namespace into an older mount namespace.
      Add a helper function proc_ns_inode so it is possible to
      detect when we are attempting to bind mound a namespace inode.
      Acked-by: 's avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: 's avatarEric W. Biederman <ebiederm@xmission.com>
  23. 14 Jul, 2012 2 commits
  24. 07 Jan, 2012 1 commit
  25. 04 Jan, 2012 5 commits