1. 01 Mar, 2019 1 commit
    • Mike Kravetz's avatar
      hugetlbfs: fix races and page leaks during migration · cb6acd01
      Mike Kravetz authored
      hugetlb pages should only be migrated if they are 'active'.  The
      routines set/clear_page_huge_active() modify the active state of hugetlb
      When a new hugetlb page is allocated at fault time, set_page_huge_active
      is called before the page is locked.  Therefore, another thread could
      race and migrate the page while it is being added to page table by the
      fault code.  This race is somewhat hard to trigger, but can be seen by
      strategically adding udelay to simulate worst case scheduling behavior.
      Depending on 'how' the code races, various BUG()s could be triggered.
      To address this issue, simply delay the set_page_huge_active call until
      after the page is successfully added to the page table.
      Hugetlb pages can also be leaked at migration time if the pages are
      associated with a file in an explicitly mounted hugetlbfs filesystem.
      For example, consider a two node system with 4GB worth of huge pages
      available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
      then migrates the pages associated with the file from one node to
      another.  When the program exits, huge page counts are as follows:
        1024    free_hugepages
        1024    nr_hugepages
        0       free_hugepages
        1024    nr_hugepages
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      That is as expected.  2G of huge pages are taken from the free_hugepages
      counts, and 2G is the size of the file in the explicitly mounted
      filesystem.  If the file is then removed, the counts become:
        1024    free_hugepages
        1024    nr_hugepages
        1024    free_hugepages
        1024    nr_hugepages
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      Note that the filesystem still shows 2G of pages used, while there
      actually are no huge pages in use.  The only way to 'fix' the filesystem
      accounting is to unmount the filesystem
      If a hugetlb page is associated with an explicitly mounted filesystem,
      this information in contained in the page_private field.  At migration
      time, this information is not preserved.  To fix, simply transfer
      page_private from old to new page at migration time if necessary.
      There is a related race with removing a huge page from a file and
      migration.  When a huge page is removed from the pagecache, the
      page_mapping() field is cleared, yet page_private remains set until the
      page is actually freed by free_huge_page().  A page could be migrated
      while in this state.  However, since page_mapping() is not set the
      hugetlbfs specific routine to transfer page_private is not called and we
      leak the page count in the filesystem.
      To fix that, check for this condition before migrating a huge page.  If
      the condition is detected, return EBUSY for the page.
      Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
      Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
      Fixes: bcc54222 ("mm: hugetlb: introduce page_huge_active")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      [mike.kravetz@oracle.com: v2]
        Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
      [mike.kravetz@oracle.com: update comment and changelog]
        Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.comSigned-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 25 Feb, 2019 2 commits
    • David Howells's avatar
      afs: Fix manually set volume location server list · 7d762d69
      David Howells authored
      When a cell with a volume location server list is added manually by
      echoing the details into /proc/net/afs/cells, a record is added but the
      flag saying it has been looked up isn't set.
      This causes the VL server rotation code to wait forever, with the top of
      /proc/pid/stack looking like:
      with the thread stuck in afs_start_vl_iteration() waiting for
      AFS_CELL_FL_NO_LOOKUP_YET to be cleared.
      Fix this by clearing AFS_CELL_FL_NO_LOOKUP_YET when setting up a record
      if that record's details were supplied manually.
      Fixes: 0a5143f2 ("afs: Implement VL server rotation")
      Reported-by: default avatarDave Botsch <dwb7@cornell.edu>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Linus Torvalds's avatar
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds authored
      This reverts commit 9da3f2b7.
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 21 Feb, 2019 1 commit
  4. 20 Feb, 2019 1 commit
  5. 19 Feb, 2019 1 commit
    • Kees Cook's avatar
      exec: load_script: Do not exec truncated interpreter path · b5372fe5
      Kees Cook authored
      Commit 8099b047 ("exec: load_script: don't blindly truncate
      shebang string") was trying to protect against a confused exec of a
      truncated interpreter path. However, it was overeager and also refused
      to truncate arguments as well, which broke userspace, and it was
      reverted. This attempts the protection again, but allows arguments to
      remain truncated. In an effort to improve readability, helper functions
      and comments have been added.
      Co-developed-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Samuel Dionne-Riel <samuel@dionne-riel.com>
      Cc: Richard Weinberger <richard.weinberger@gmail.com>
      Cc: Graham Christensen <graham@grahamc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 18 Feb, 2019 1 commit
  7. 15 Feb, 2019 1 commit
    • David Howells's avatar
      keys: Fix dependency loop between construction record and auth key · 822ad64d
      David Howells authored
      In the request_key() upcall mechanism there's a dependency loop by which if
      a key type driver overrides the ->request_key hook and the userspace side
      manages to lose the authorisation key, the auth key and the internal
      construction record (struct key_construction) can keep each other pinned.
      Fix this by the following changes:
       (1) Killing off the construction record and using the auth key instead.
       (2) Including the operation name in the auth key payload and making the
           payload available outside of security/keys/.
       (3) The ->request_key hook is given the authkey instead of the cons
           record and operation name.
      Changes (2) and (3) allow the auth key to naturally be cleaned up if the
      keyring it is in is destroyed or cleared or the auth key is unlinked.
      Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
  8. 14 Feb, 2019 3 commits
  9. 13 Feb, 2019 2 commits
  10. 12 Feb, 2019 1 commit
  11. 06 Feb, 2019 3 commits
  12. 03 Feb, 2019 3 commits
    • Darrick J. Wong's avatar
      xfs: set buffer ops when repair probes for btree type · add46b3b
      Darrick J. Wong authored
      In xrep_findroot_block, we work out the btree type and correctness of a
      given block by calling different btree verifiers on root block
      candidates.  However, we leave the NULL b_ops while ->verify_read
      validates the block, which means that if the verifier calls
      xfs_buf_verifier_error it'll crash on the null b_ops.  Fix it to set
      b_ops before calling the verifier and unsetting it if the verifier
      Furthermore, improve the documentation around xfs_buf_ensure_ops, which
      is the function that is responsible for cleaning up the b_ops state of
      buffers that go through xrep_findroot_block but don't match anything.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
    • Brian Foster's avatar
      xfs: end sync buffer I/O properly on shutdown error · 465fa17f
      Brian Foster authored
      As of commit e339dd8d ("xfs: use sync buffer I/O for sync delwri
      queue submission"), the delwri submission code uses sync buffer I/O
      for sync delwri I/O. Instead of waiting on async I/O to unlock the
      buffer, it uses the underlying sync I/O completion mechanism.
      If delwri buffer submission fails due to a shutdown scenario, an
      error is set on the buffer and buffer completion never occurs. This
      can cause xfs_buf_delwri_submit() to deadlock waiting on a
      completion event.
      We could check the error state before waiting on such buffers, but
      that doesn't serialize against the case of an error set via a racing
      I/O completion. Instead, invoke I/O completion in the shutdown case
      regardless of buffer I/O type.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
    • Brian Foster's avatar
      xfs: eof trim writeback mapping as soon as it is cached · aa6ee4ab
      Brian Foster authored
      The cached writeback mapping is EOF trimmed to try and avoid races
      between post-eof block management and writeback that result in
      sending cached data to a stale location. The cached mapping is
      currently trimmed on the validation check, which leaves a race
      window between the time the mapping is cached and when it is trimmed
      against the current inode size.
      For example, if a new mapping is cached by delalloc conversion on a
      blocksize == page size fs, we could cycle various locks, perform
      memory allocations, etc.  in the writeback codepath before the
      associated mapping is eventually trimmed to i_size. This leaves
      enough time for a post-eof truncate and file append before the
      cached mapping is trimmed. The former event essentially invalidates
      a range of the cached mapping and the latter bumps the inode size
      such the trim on the next writepage event won't trim all of the
      invalid blocks. fstest generic/464 reproduces this scenario
      occasionally and causes a lost writeback and stale delalloc blocks
      warning on inode inactivation.
      To work around this problem, trim the cached writeback mapping as
      soon as it is cached in addition to on subsequent validation checks.
      This is a minor tweak to tighten the race window as much as possible
      until a proper invalidation mechanism is available.
      Fixes: 40214d12 ("xfs: trim writepage mapping to within eof")
      Cc: <stable@vger.kernel.org> # v4.14+
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
  13. 01 Feb, 2019 5 commits
  14. 31 Jan, 2019 3 commits
  15. 30 Jan, 2019 6 commits
    • Waiman Long's avatar
      fs/dcache: Track & report number of negative dentries · af0c9af1
      Waiman Long authored
      The current dentry number tracking code doesn't distinguish between
      positive & negative dentries.  It just reports the total number of
      dentries in the LRU lists.
      As excessive number of negative dentries can have an impact on system
      performance, it will be wise to track the number of positive and
      negative dentries separately.
      This patch adds tracking for the total number of negative dentries in
      the system LRU lists and reports it in the 5th field in the
      /proc/sys/fs/dentry-state file.  The number, however, does not include
      negative dentries that are in flight but not in the LRU yet as well as
      those in the shrinker lists which are on the way out anyway.
      The number of positive dentries in the LRU lists can be roughly found by
      subtracting the number of negative dentries from the unused count.
      Matthew Wilcox had confirmed that since the introduction of the
      dentry_stat structure in 2.1.60, the dummy array was there, probably for
      future extension.  They were not replacements of pre-existing fields.
      So no sane applications that read the value of /proc/sys/fs/dentry-state
      will do dummy thing if the last 2 fields of the sysctl parameter are not
      zero.  IOW, it will be safe to use one of the dummy array entry for
      negative dentry count.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Waiman Long's avatar
      fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb() · 1dbd449c
      Waiman Long authored
      The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
      lists and the shrink lists where the DCACHE_LRU_LIST bit is set.
      The shrink_dcache_sb() function moves dentries from the LRU list to a
      shrink list and subtracts the dentry count from nr_dentry_unused.  This
      is incorrect as the nr_dentry_unused count will also be decremented in
      shrink_dentry_list() via d_shrink_del().
      To fix this double decrement, the decrement in the shrink_dcache_sb()
      function is taken out.
      Fixes: 4e717f5c ("list_lru: remove special case function list_lru_dispose_all."
      Cc: stable@kernel.org
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Eric W. Biederman's avatar
      btrfs: On error always free subvol_name in btrfs_mount · 532b618b
      Eric W. Biederman authored
      The subvol_name is allocated in btrfs_parse_subvol_options and is
      consumed and freed in mount_subvol.  Add a free to the error paths that
      don't call mount_subvol so that it is guaranteed that subvol_name is
      freed when an error happens.
      Fixes: 312c89fb ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
      Cc: stable@vger.kernel.org # v4.19+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • David Sterba's avatar
      btrfs: clean up pending block groups when transaction commit aborts · c7cc64a9
      David Sterba authored
      The fstests generic/475 stresses transaction aborts and can reveal
      space accounting or use-after-free bugs regarding block goups.
      In this case the pending block groups that remain linked to the
      structures after transaction commit aborts in the middle.
      The corrupted slabs lead to failures in following tests, eg. generic/476
        [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
        [ 8172.755799] #PF error: [normal kernel read fault]
        [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0
        [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP
        [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G        W         5.0.0-rc2-default #408
        [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90
        [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287
        [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000
        [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0
        [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000
        [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78
        [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200
        [ 8172.778886] FS:  0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000
        [ 8172.780545] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0
        [ 8172.783547] Call Trace:
        [ 8172.784112]  shrink_inactive_list+0x194/0x410
        [ 8172.784747]  shrink_node_memcg.constprop.85+0x3a5/0x6a0
        [ 8172.785472]  shrink_node+0x62/0x1e0
        [ 8172.786011]  balance_pgdat+0x216/0x460
        [ 8172.786577]  kswapd+0xe3/0x4a0
        [ 8172.787085]  ? finish_wait+0x80/0x80
        [ 8172.787795]  ? balance_pgdat+0x460/0x460
        [ 8172.788799]  kthread+0x116/0x130
        [ 8172.789640]  ? kthread_create_on_node+0x60/0x60
        [ 8172.790323]  ret_from_fork+0x24/0x30
        [ 8172.794253] CR2: 0000000000000058
      or accounting errors at umount time:
        [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs]
        [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G        W         5.0.0-rc2-default #408
        [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs]
        [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206
        [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000
        [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800
        [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000
        [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108
        [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100
        [ 8159.562852] FS:  00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000
        [ 8159.564839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0
        [ 8159.567898] Call Trace:
        [ 8159.568597]  close_ctree+0x17f/0x350 [btrfs]
        [ 8159.569628]  generic_shutdown_super+0x64/0x100
        [ 8159.570808]  kill_anon_super+0x14/0x30
        [ 8159.571857]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [ 8159.573063]  deactivate_locked_super+0x29/0x60
        [ 8159.574234]  cleanup_mnt+0x3b/0x70
        [ 8159.575176]  task_work_run+0x98/0xc0
        [ 8159.576177]  exit_to_usermode_loop+0x83/0x90
        [ 8159.577315]  do_syscall_64+0x15b/0x180
        [ 8159.578339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      This fix is based on 2 Josef's patches that used sideefects of
      btrfs_create_pending_block_groups, this fix introduces the helper that
      does what we need.
      CC: stable@vger.kernel.org # 4.4+
      CC: Josef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Al Viro's avatar
      btrfs: fix potential oops in device_list_add · 92900e51
      Al Viro authored
      alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its
      result before the check for IS_ERR() is a bad idea.
      Fixes: d1a63002 ("btrfs: add members to fs_devices to track fsid changes")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    • Greg Kroah-Hartman's avatar
      debugfs: debugfs_lookup() should return NULL if not found · 37ea7b63
      Greg Kroah-Hartman authored
      Lots of callers of debugfs_lookup() were just checking NULL to see if
      the file/directory was found or not.  By changing this in ff9fb72b
      ("debugfs: return error values, not NULL") we caused some subsystems to
      easily crash.
      Fixes: ff9fb72b ("debugfs: return error values, not NULL")
      Reported-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  16. 29 Jan, 2019 6 commits