1. 28 Dec, 2018 3 commits
    • Mike Kravetz's avatar
      hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race · c86aa7bb
      Mike Kravetz authored
      hugetlbfs page faults can race with truncate and hole punch operations.
      Current code in the page fault path attempts to handle this by 'backing
      out' operations if we encounter the race.  One obvious omission in the
      current code is removing a page newly added to the page cache.  This is
      pretty straight forward to address, but there is a more subtle and
      difficult issue of backing out hugetlb reservations.  To handle this
      correctly, the 'reservation state' before page allocation needs to be
      noted so that it can be properly backed out.  There are four distinct
      possibilities for reservation state: shared/reserved, shared/no-resv,
      private/reserved and private/no-resv.  Backing out a reservation may
      require memory allocation which could fail so that needs to be taken into
      account as well.
      
      Instead of writing the required complicated code for this rare occurrence,
      just eliminate the race.  i_mmap_rwsem is now held in read mode for the
      duration of page fault processing.  Hold i_mmap_rwsem longer in truncation
      and hold punch code to cover the call to remove_inode_hugepages.
      
      With this modification, code in remove_inode_hugepages checking for races
      becomes 'dead' as it can not longer happen.  Remove the dead code and
      expand comments to explain reasoning.  Similarly, checks for races with
      truncation in the page fault path can be simplified and removed.
      
      [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
        Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
      Fixes: ebed4bfc ("hugetlb: fix absurd HugePages_Rsvd")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c86aa7bb
    • Mike Kravetz's avatar
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · b43a9990
      Mike Kravetz authored
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with the
        ptep.
      
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
        called.
      
      [mike.kravetz@oracle.com: add explicit check for mapping != null]
      Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b43a9990
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 · ac46d4f3
      Jérôme Glisse authored
      To avoid having to change many call sites everytime we want to add a
      parameter use a structure to group all parameters for the mmu_notifier
      invalidate_range_start/end cakks.  No functional changes with this patch.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      From: Jérôme Glisse <jglisse@redhat.com>
      Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
      
      fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
      
      Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac46d4f3
  2. 14 Dec, 2018 1 commit
  3. 30 Nov, 2018 1 commit
    • Andrea Arcangeli's avatar
      userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails · 9e368259
      Andrea Arcangeli authored
      Patch series "userfaultfd shmem updates".
      
      Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
      lack of the VM_MAYWRITE check and the lack of i_size checks.
      
      Then looking into the above we also fixed the MAP_PRIVATE case.
      
      Hugh by source review also found a data loss source if UFFDIO_COPY is
      used on shmem MAP_SHARED PROT_READ mappings (the production usages
      incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
      happen in those production usages like with QEMU).
      
      The whole patchset is marked for stable.
      
      We verified QEMU postcopy live migration with guest running on shmem
      MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
      Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
      unconditionally invokes a punch hole if the guest mapping is filebacked
      and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
      for the anon backend).
      
      This patch (of 5):
      
      We internally used EFAULT to communicate with the caller, switch to
      ENOENT, so EFAULT can be used as a non internal retval.
      
      Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
      Fixes: 4c27fe4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e368259
  4. 18 Nov, 2018 1 commit
    • Mike Kravetz's avatar
      hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! · 5e41540c
      Mike Kravetz authored
      This bug has been experienced several times by the Oracle DB team.  The
      BUG is in remove_inode_hugepages() as follows:
      
      	/*
      	 * If page is mapped, it was faulted in after being
      	 * unmapped in caller.  Unmap (again) now after taking
      	 * the fault mutex.  The mutex will prevent faults
      	 * until we finish removing the page.
      	 *
      	 * This race can only happen in the hole punch case.
      	 * Getting here in a truncate operation is a bug.
      	 */
      	if (unlikely(page_mapped(page))) {
      		BUG_ON(truncate_op);
      
      In this case, the elevated map count is not the result of a race.
      Rather it was incorrectly incremented as the result of a bug in the huge
      pmd sharing code.  Consider the following:
      
       - Process A maps a hugetlbfs file of sufficient size and alignment
         (PUD_SIZE) that a pmd page could be shared.
      
       - Process B maps the same hugetlbfs file with the same size and
         alignment such that a pmd page is shared.
      
       - Process B then calls mprotect() to change protections for the mapping
         with the shared pmd. As a result, the pmd is 'unshared'.
      
       - Process B then calls mprotect() again to chage protections for the
         mapping back to their original value. pmd remains unshared.
      
       - Process B then forks and process C is created. During the fork
         process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
         tables. Copying page tables for hugetlb mappings is done in the
         routine copy_hugetlb_page_range.
      
      In copy_hugetlb_page_range(), the destination pte is obtained by:
      
      	dst_pte = huge_pte_alloc(dst, addr, sz);
      
      If pmd sharing is possible, the returned pointer will be to a pte in an
      existing page table.  In the situation above, process C could share with
      either process A or process B.  Since process A is first in the list,
      the returned pte is a pointer to a pte in process A's page table.
      
      However, the check for pmd sharing in copy_hugetlb_page_range is:
      
      	/* If the pagetables are shared don't copy or take references */
      	if (dst_pte == src_pte)
      		continue;
      
      Since process C is sharing with process A instead of process B, the
      above test fails.  The code in copy_hugetlb_page_range which follows
      assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
      from src_pte to dst_pte and increments this map count of the associated
      page.  This is how we end up with an elevated map count.
      
      To solve, check the dst_pte entry for huge_pte_none.  If !none, this
      implies PMD sharing so do not copy.
      
      Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
      Fixes: c5c99429 ("fix hugepages leak due to pagetable page sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e41540c
  5. 31 Oct, 2018 3 commits
    • Mike Rapoport's avatar
      mm: remove include/linux/bootmem.h · 57c8a661
      Mike Rapoport authored
      Move remaining definitions and declarations from include/linux/bootmem.h
      into include/linux/memblock.h and remove the redundant header.
      
      The includes were replaced with the semantic patch below and then
      semi-automated removal of duplicated '#include <linux/memblock.h>
      
      @@
      @@
      - #include <linux/bootmem.h>
      + #include <linux/memblock.h>
      
      [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
      [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
      [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
        Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57c8a661
    • Mike Rapoport's avatar
      memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants · 97ad1087
      Mike Rapoport authored
      Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of
      identical MEMBLOCK definitions.
      
      Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97ad1087
    • Mike Rapoport's avatar
      memblock: remove _virt from APIs returning virtual address · eb31d559
      Mike Rapoport authored
      The conversion is done using
      
      sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
      	$(git grep -l memblock_virt_alloc)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb31d559
  6. 26 Oct, 2018 1 commit
    • Mike Kravetz's avatar
      hugetlbfs: dirty pages as they are added to pagecache · 22146c3c
      Mike Kravetz authored
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMihcla Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22146c3c
  7. 05 Oct, 2018 2 commits
  8. 24 Aug, 2018 2 commits
    • Souptick Joarder's avatar
      mm: Change return type int to vm_fault_t for fault handlers · 2b740303
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      The aim is to change the return type of finish_fault() and
      handle_mm_fault() to vm_fault_t type.  As part of that clean up return
      type of all other recursively called functions have been changed to
      vm_fault_t type.
      
      The places from where handle_mm_fault() is getting invoked will be
      change to vm_fault_t type but in a separate patch.
      
      vmf_error() is the newly introduce inline function in 4.17-rc6.
      
      [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
      Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PCSigned-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b740303
    • Naoya Horiguchi's avatar
      mm: fix race on soft-offlining free huge pages · 6bc9b564
      Naoya Horiguchi authored
      Patch series "mm: soft-offline: fix race against page allocation".
      
      Xishi recently reported the issue about race on reusing the target pages
      of soft offlining.  Discussion and analysis showed that we need make
      sure that setting PG_hwpoison should be done in the right place under
      zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
      hanldes free buddy page's case.
      
      This patch (of 2):
      
      There's a race condition between soft offline and hugetlb_fault which
      causes unexpected process killing and/or hugetlb allocation failure.
      
      The process killing is caused by the following flow:
      
        CPU 0               CPU 1              CPU 2
      
        soft offline
          get_any_page
          // find the hugetlb is free
                            mmap a hugetlb file
                            page fault
                              ...
                                hugetlb_fault
                                  hugetlb_no_page
                                    alloc_huge_page
                                    // succeed
            soft_offline_free_page
            // set hwpoison flag
                                               mmap the hugetlb file
                                               page fault
                                                 ...
                                                   hugetlb_fault
                                                     hugetlb_no_page
                                                       find_lock_page
                                                         return VM_FAULT_HWPOISON
                                                 mm_fault_error
                                                   do_sigbus
                                                   // kill the process
      
      The hugetlb allocation failure comes from the following flow:
      
        CPU 0                          CPU 1
      
                                       mmap a hugetlb file
                                       // reserve all free page but don't fault-in
        soft offline
          get_any_page
          // find the hugetlb is free
            soft_offline_free_page
            // set hwpoison flag
              dissolve_free_huge_page
              // fail because all free hugepages are reserved
                                       page fault
                                         ...
                                           hugetlb_fault
                                             hugetlb_no_page
                                               alloc_huge_page
                                                 ...
                                                   dequeue_huge_page_node_exact
                                                   // ignore hwpoisoned hugepage
                                                   // and finally fail due to no-mem
      
      The root cause of this is that current soft-offline code is written based
      on an assumption that PageHWPoison flag should be set at first to avoid
      accessing the corrupted data.  This makes sense for memory_failure() or
      hard offline, but does not for soft offline because soft offline is about
      corrected (not uncorrected) error and is safe from data lost.  This patch
      changes soft offline semantics where it sets PageHWPoison flag only after
      containment of the error page completes successfully.
      
      Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Suggested-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bc9b564
  9. 17 Aug, 2018 4 commits
  10. 02 Aug, 2018 1 commit
  11. 04 Jul, 2018 1 commit
  12. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook authored
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6da2ec56
  13. 08 Jun, 2018 2 commits
    • Huang Ying's avatar
      mm, hugetlbfs: pass fault address to no page handler · 285b8dca
      Huang Ying authored
      This is to take better advantage of general huge page clearing
      optimization (commit c79b57e4: "mm: hugetlb: clear target sub-page
      last when clearing huge page") for hugetlbfs.
      
      In the general optimization patch, the sub-page to access will be
      cleared last to avoid the cache lines of to access sub-page to be
      evicted when clearing other sub-pages.  This works better if we have the
      address of the sub-page to access, that is, the fault address inside the
      huge page.  So the hugetlbfs no page fault handler is changed to pass
      that information.  This will benefit workloads which don't access the
      begin of the hugetlbfs huge page after the page fault under heavy cache
      contention for shared last level cache.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on hugetlbfs.
      
      With this patch, the throughput increases ~28.1% in vm-scalability
      anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
      system (44 cores, 88 threads).  The test case creates 88 processes, each
      process mmaps a big anonymous memory area with MAP_HUGETLB and writes to
      it from the end to the begin.  For each process, other processes could
      be seen as other workload which generates heavy cache pressure.  At the
      same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC
      (instruction per cycle) increased from 0.3 to 0.37, and the time spent
      in user space is reduced ~19.3%.
      
      Link: http://lkml.kernel.org/r/20180517083539.9242-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      285b8dca
    • Souptick Joarder's avatar
      mm: change return type to vm_fault_t · b3ec9f33
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler in struct
      vm_operations_struct.  For now, this is just documenting that the
      function returns a VM_FAULT value rather than an errno.  Once all
      instances are converted, vm_fault_t will become a distinct type.
      
      See commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PCSigned-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ec9f33
  14. 16 Apr, 2018 1 commit
  15. 06 Apr, 2018 2 commits
  16. 23 Mar, 2018 1 commit
  17. 10 Mar, 2018 1 commit
  18. 01 Feb, 2018 9 commits
    • Michal Hocko's avatar
      hugetlb, mbind: fall back to default policy if vma is NULL · 389c8178
      Michal Hocko authored
      Dan Carpenter has noticed that mbind migration callback (new_page) can
      get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
      relies on the VMA to get the hstate.  We used to BUG_ON this case but
      the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
      mbind hugetlb migration".
      
      The proper way to handle this is to get the hstate from the migrated
      page and rely on huge_node (resp.  get_vma_policy) do the right thing
      with null VMA.  We are currently falling back to the default mempolicy
      in that case which is in line what THP path is doing here.
      
      Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      389c8178
    • Michal Hocko's avatar
      hugetlb, mempolicy: fix the mbind hugetlb migration · ebd63723
      Michal Hocko authored
      do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
      pages.  alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
      allocation function which has to take care of reserves, overcommit or
      hugetlb cgroup accounting.  None of that is really required for the page
      migration because the new page is only temporal and either will replace
      the original page or it will be dropped.  This is essentially as for
      other migration call paths and there shouldn't be any reason to handle
      mbind in a special way.
      
      The current implementation is even suboptimal because the migration
      might fail just because the hugetlb cgroup limit is reached, or the
      overcommit is saturated.
      
      Fix this by making mbind like other hugetlb migration paths.  Add a new
      migration helper alloc_huge_page_vma as a wrapper around
      alloc_huge_page_nodemask with additional mempolicy handling.
      
      alloc_huge_page_noerr has no more users and it can go.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebd63723
    • Michal Hocko's avatar
      mm, hugetlb: further simplify hugetlb allocation API · 0c397dae
      Michal Hocko authored
      Hugetlb allocator has several layer of allocation functions depending
      and the purpose of the allocation.  There are two allocators depending
      on whether the page can be allocated from the page allocator or we need
      a contiguous allocator.  This is currently opencoded in
      alloc_fresh_huge_page which is the only path that might allocate giga
      pages which require the later allocator.  Create alloc_fresh_huge_page
      which hides this implementation detail and use it in all callers which
      hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
      This shouldn't introduce any funtional change because both migration and
      surplus allocators exlude giga pages explicitly.
      
      While we are at it let's do some renaming.  The current scheme is not
      consistent and overly painfull to read and understand.  Get rid of
      prefix underscores from most functions.  There is no real reason to make
      names longer.
      
      * alloc_fresh_huge_page is the new layer to abstract underlying
        allocator
      * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
        alloc_buddy_huge_page.
      * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
        the new page directly to the pool
      * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
        as it uses alloc_fresh_huge_page now
      * others lose their excessive prefix underscores to make names shorter
      
      [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
        Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
      Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c397dae
    • Michal Hocko's avatar
      mm, hugetlb: get rid of surplus page accounting tricks · 9980d744
      Michal Hocko authored
      alloc_surplus_huge_page increases the pool size and the number of
      surplus pages opportunistically to prevent from races with the pool size
      change.  See commit d1c3fb1f ("hugetlb: introduce
      nr_overcommit_hugepages sysctl") for more details.
      
      The resulting code is unnecessarily hairy, cause code duplication and
      doesn't allow to share the allocation paths.  Moreover pool size changes
      tend to be very seldom so optimizing for them is not really reasonable.
      Simplify the code and allow to allocate a fresh surplus page as long as
      we are under the overcommit limit and then recheck the condition after
      the allocation and drop the new page if the situation has changed.  This
      should provide a reasonable guarantee that an abrupt allocation requests
      will not go way off the limit.
      
      If we consider races with the pool shrinking and enlarging then we
      should be reasonably safe as well.  In the first case we are off by one
      in the worst case and the second case should work OK because the page is
      not yet visible.  We can waste CPU cycles for the allocation but that
      should be acceptable for a relatively rare condition.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9980d744
    • Michal Hocko's avatar
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko authored
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
    • Michal Hocko's avatar
      mm, hugetlb: integrate giga hugetlb more naturally to the allocation path · d9cc948f
      Michal Hocko authored
      Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
      specie with a lot of special casing.  The allocation path is not an
      exception.  Unnecessarily so to be honest.  It is true that the
      underlying allocator is different but that is an implementation detail.
      
      This patch unifies the hugetlb allocation path that a prepares fresh
      pool pages.  alloc_fresh_gigantic_page basically copies
      alloc_fresh_huge_page logic so we can move everything there.  This will
      simplify set_max_huge_pages which doesn't have to care about what kind
      of huge page we allocate.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9cc948f
    • Michal Hocko's avatar
      mm, hugetlb: unify core page allocation accounting and initialization · af0fb9df
      Michal Hocko authored
      Patch series "mm, hugetlb: allocation API and migration improvements"
      
      Motivation:
      
      this is a follow up for [3] for the allocation API and [4] for the
      hugetlb migration.  It wasn't really easy to split those into two
      separate patch series as they share some code.
      
      My primary motivation to touch this code is to make the gigantic pages
      migration working.  The giga pages allocation code is just too fragile
      and hacked into the hugetlb code now.  This series tries to move giga
      pages closer to the first class citizen.  We are not there yet but
      having 5 patches is quite a lot already and it will already make the
      code much easier to follow.  I will come with other changes on top after
      this sees some review.
      
      The first two patches should be trivial to review.  The third patch
      changes the way how we migrate huge pages.  Newly allocated pages are a
      subject of the overcommit check and they participate surplus accounting
      which is quite unfortunate as the changelog explains.  This patch
      doesn't change anything wrt.  giga pages.
      
      Patch #4 removes the surplus accounting hack from
      __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
      deeper review is really due there.
      
      Patch #5 finally unifies allocation paths and giga pages shouldn't be
      any special anymore.  There is also some renaming going on as well.
      
      This patch (of 6):
      
      hugetlb allocator has two entry points to the page allocator
       - alloc_fresh_huge_page_node
       - __hugetlb_alloc_buddy_huge_page
      
      The two differ very subtly in two aspects.  The first one doesn't care
      about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
      prep_new_huge_page is not used because it not only initializes hugetlb
      specific stuff but because it also put_page and releases the page to the
      hugetlb pool which is not what is required in some contexts.  This makes
      things more complicated than necessary.
      
      Simplify things by a) removing the page allocator entry point duplicity
      and only keep __hugetlb_alloc_buddy_huge_page and b) make
      prep_new_huge_page more reusable by removing the put_page which moves
      the page to the allocator pool.  All current callers are updated to call
      put_page explicitly.  Later patches will add new callers which won't
      need it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0fb9df
    • Michal Hocko's avatar
      mm, hugetlb: remove hugepages_treat_as_movable sysctl · d6cb41cc
      Michal Hocko authored
      hugepages_treat_as_movable has been introduced by 396faf03 ("Allow
      huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
      allocations from ZONE_MOVABLE even when hugetlb pages were not
      migrateable.  The purpose of the movable zone was different at the time.
      It aimed at reducing memory fragmentation and hugetlb pages being long
      lived and large werre not contributing to the fragmentation so it was
      acceptable to use the zone back then.
      
      Things have changed though and the primary purpose of the zone became
      migratability guarantee.  If we allow non migrateable hugetlb pages to
      be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
      
      Remove the knob and only rely on hugepage_migration_supported to allow
      movable zones.
      
      Mel said:
      
      : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
      : the ability to grow it again.  The use case was for batched jobs, some of
      : which needed huge pages and others that did not but didn't want the memory
      : useless pinned in the huge pages pool.
      :
      : I suspect that more users rely on THP than hugetlbfs for flexible use of
      : huge pages with fallback options so I think that removing the option
      : should be ok.
      
      Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6cb41cc
    • Roman Gushchin's avatar
      mm: show total hugetlb memory consumption in /proc/meminfo · fcb2b0c5
      Roman Gushchin authored
      Currently we display some hugepage statistics (total, free, etc) in
      /proc/meminfo, but only for default hugepage size (e.g.  2Mb).
      
      If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
      /proc/meminfo output can be confusing, as non-default sized hugepages
      are not reflected at all, and there are no signs that they are existing
      and consuming system memory.
      
      To solve this problem, let's display the total amount of memory,
      consumed by hugetlb pages of all sized (both free and used).  Let's call
      it "Hugetlb", and display size in kB to match generic /proc/meminfo
      style.
      
      For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
        $ cat /proc/meminfo
        MemTotal:        8168984 kB
        MemFree:         3789276 kB
        <...>
        CmaFree:               0 kB
        HugePages_Total:    1024
        HugePages_Free:     1024
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:         4194304 kB
        DirectMap4k:       32632 kB
        DirectMap2M:     4161536 kB
        DirectMap1G:     6291456 kB
      
      Also, this patch updates corresponding docs to reflect Hugetlb entry
      meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.
      
      Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcb2b0c5
  19. 30 Nov, 2017 2 commits
  20. 16 Nov, 2017 1 commit
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e