1. 31 Oct, 2018 3 commits
    • Mike Rapoport's avatar
      mm: remove include/linux/bootmem.h · 57c8a661
      Mike Rapoport authored
      Move remaining definitions and declarations from include/linux/bootmem.h
      into include/linux/memblock.h and remove the redundant header.
      
      The includes were replaced with the semantic patch below and then
      semi-automated removal of duplicated '#include <linux/memblock.h>
      
      @@
      @@
      - #include <linux/bootmem.h>
      + #include <linux/memblock.h>
      
      [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
      [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
      [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
        Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57c8a661
    • Mike Rapoport's avatar
      memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants · 97ad1087
      Mike Rapoport authored
      Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of
      identical MEMBLOCK definitions.
      
      Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97ad1087
    • Mike Rapoport's avatar
      memblock: remove _virt from APIs returning virtual address · eb31d559
      Mike Rapoport authored
      The conversion is done using
      
      sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
      	$(git grep -l memblock_virt_alloc)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb31d559
  2. 26 Oct, 2018 1 commit
    • Mike Kravetz's avatar
      hugetlbfs: dirty pages as they are added to pagecache · 22146c3c
      Mike Kravetz authored
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMihcla Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22146c3c
  3. 05 Oct, 2018 2 commits
  4. 24 Aug, 2018 2 commits
    • Souptick Joarder's avatar
      mm: Change return type int to vm_fault_t for fault handlers · 2b740303
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      The aim is to change the return type of finish_fault() and
      handle_mm_fault() to vm_fault_t type.  As part of that clean up return
      type of all other recursively called functions have been changed to
      vm_fault_t type.
      
      The places from where handle_mm_fault() is getting invoked will be
      change to vm_fault_t type but in a separate patch.
      
      vmf_error() is the newly introduce inline function in 4.17-rc6.
      
      [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
      Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PCSigned-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b740303
    • Naoya Horiguchi's avatar
      mm: fix race on soft-offlining free huge pages · 6bc9b564
      Naoya Horiguchi authored
      Patch series "mm: soft-offline: fix race against page allocation".
      
      Xishi recently reported the issue about race on reusing the target pages
      of soft offlining.  Discussion and analysis showed that we need make
      sure that setting PG_hwpoison should be done in the right place under
      zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
      hanldes free buddy page's case.
      
      This patch (of 2):
      
      There's a race condition between soft offline and hugetlb_fault which
      causes unexpected process killing and/or hugetlb allocation failure.
      
      The process killing is caused by the following flow:
      
        CPU 0               CPU 1              CPU 2
      
        soft offline
          get_any_page
          // find the hugetlb is free
                            mmap a hugetlb file
                            page fault
                              ...
                                hugetlb_fault
                                  hugetlb_no_page
                                    alloc_huge_page
                                    // succeed
            soft_offline_free_page
            // set hwpoison flag
                                               mmap the hugetlb file
                                               page fault
                                                 ...
                                                   hugetlb_fault
                                                     hugetlb_no_page
                                                       find_lock_page
                                                         return VM_FAULT_HWPOISON
                                                 mm_fault_error
                                                   do_sigbus
                                                   // kill the process
      
      The hugetlb allocation failure comes from the following flow:
      
        CPU 0                          CPU 1
      
                                       mmap a hugetlb file
                                       // reserve all free page but don't fault-in
        soft offline
          get_any_page
          // find the hugetlb is free
            soft_offline_free_page
            // set hwpoison flag
              dissolve_free_huge_page
              // fail because all free hugepages are reserved
                                       page fault
                                         ...
                                           hugetlb_fault
                                             hugetlb_no_page
                                               alloc_huge_page
                                                 ...
                                                   dequeue_huge_page_node_exact
                                                   // ignore hwpoisoned hugepage
                                                   // and finally fail due to no-mem
      
      The root cause of this is that current soft-offline code is written based
      on an assumption that PageHWPoison flag should be set at first to avoid
      accessing the corrupted data.  This makes sense for memory_failure() or
      hard offline, but does not for soft offline because soft offline is about
      corrected (not uncorrected) error and is safe from data lost.  This patch
      changes soft offline semantics where it sets PageHWPoison flag only after
      containment of the error page completes successfully.
      
      Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Suggested-by: default avatarXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6bc9b564
  5. 17 Aug, 2018 4 commits
  6. 02 Aug, 2018 1 commit
    • Jane Chu's avatar
      ipc/shm.c add ->pagesize function to shm_vm_ops · eec3636a
      Jane Chu authored
      Commit 05ea8860 ("mm, hugetlbfs: introduce ->pagesize() to
      vm_operations_struct") adds a new ->pagesize() function to
      hugetlb_vm_ops, intended to cover all hugetlbfs backed files.
      
      With System V shared memory model, if "huge page" is specified, the
      "shared memory" is backed by hugetlbfs files, but the mappings initiated
      via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
      so we need to add a ->pagesize function to shm_vm_ops.  Otherwise,
      vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
      result in below BUG:
      
        fs/hugetlbfs/inode.c
              443             if (unlikely(page_mapped(page))) {
              444                     BUG_ON(truncate_op);
      
      resulting in
      
        hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
        ------------[ cut here ]------------
        kernel BUG at fs/hugetlbfs/inode.c:444!
        Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
        CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
        RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
        ....
        Call Trace:
          hugetlbfs_evict_inode+0x1e/0x3e
          evict+0xdb/0x1af
          iput+0x1a2/0x1f7
          dentry_unlink_inode+0xc6/0xf0
          __dentry_kill+0xd8/0x18d
          dput+0x1b5/0x1ed
          __fput+0x18b/0x216
          ____fput+0xe/0x10
          task_work_run+0x90/0xa7
          exit_to_usermode_loop+0xdd/0x116
          do_syscall_64+0x187/0x1ae
          entry_SYSCALL_64_after_hwframe+0x150/0x0
      
      [jane.chu@oracle.com: relocate comment]
        Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
      Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
      Fixes: 05ea8860 ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
      Signed-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eec3636a
  7. 04 Jul, 2018 1 commit
  8. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook authored
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6da2ec56
  9. 08 Jun, 2018 2 commits
    • Huang Ying's avatar
      mm, hugetlbfs: pass fault address to no page handler · 285b8dca
      Huang Ying authored
      This is to take better advantage of general huge page clearing
      optimization (commit c79b57e4: "mm: hugetlb: clear target sub-page
      last when clearing huge page") for hugetlbfs.
      
      In the general optimization patch, the sub-page to access will be
      cleared last to avoid the cache lines of to access sub-page to be
      evicted when clearing other sub-pages.  This works better if we have the
      address of the sub-page to access, that is, the fault address inside the
      huge page.  So the hugetlbfs no page fault handler is changed to pass
      that information.  This will benefit workloads which don't access the
      begin of the hugetlbfs huge page after the page fault under heavy cache
      contention for shared last level cache.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on hugetlbfs.
      
      With this patch, the throughput increases ~28.1% in vm-scalability
      anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
      system (44 cores, 88 threads).  The test case creates 88 processes, each
      process mmaps a big anonymous memory area with MAP_HUGETLB and writes to
      it from the end to the begin.  For each process, other processes could
      be seen as other workload which generates heavy cache pressure.  At the
      same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC
      (instruction per cycle) increased from 0.3 to 0.37, and the time spent
      in user space is reduced ~19.3%.
      
      Link: http://lkml.kernel.org/r/20180517083539.9242-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      285b8dca
    • Souptick Joarder's avatar
      mm: change return type to vm_fault_t · b3ec9f33
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler in struct
      vm_operations_struct.  For now, this is just documenting that the
      function returns a VM_FAULT value rather than an errno.  Once all
      instances are converted, vm_fault_t will become a distinct type.
      
      See commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PCSigned-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ec9f33
  10. 16 Apr, 2018 1 commit
  11. 06 Apr, 2018 2 commits
  12. 23 Mar, 2018 1 commit
  13. 10 Mar, 2018 1 commit
  14. 01 Feb, 2018 9 commits
    • Michal Hocko's avatar
      hugetlb, mbind: fall back to default policy if vma is NULL · 389c8178
      Michal Hocko authored
      Dan Carpenter has noticed that mbind migration callback (new_page) can
      get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
      relies on the VMA to get the hstate.  We used to BUG_ON this case but
      the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
      mbind hugetlb migration".
      
      The proper way to handle this is to get the hstate from the migrated
      page and rely on huge_node (resp.  get_vma_policy) do the right thing
      with null VMA.  We are currently falling back to the default mempolicy
      in that case which is in line what THP path is doing here.
      
      Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      389c8178
    • Michal Hocko's avatar
      hugetlb, mempolicy: fix the mbind hugetlb migration · ebd63723
      Michal Hocko authored
      do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
      pages.  alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
      allocation function which has to take care of reserves, overcommit or
      hugetlb cgroup accounting.  None of that is really required for the page
      migration because the new page is only temporal and either will replace
      the original page or it will be dropped.  This is essentially as for
      other migration call paths and there shouldn't be any reason to handle
      mbind in a special way.
      
      The current implementation is even suboptimal because the migration
      might fail just because the hugetlb cgroup limit is reached, or the
      overcommit is saturated.
      
      Fix this by making mbind like other hugetlb migration paths.  Add a new
      migration helper alloc_huge_page_vma as a wrapper around
      alloc_huge_page_nodemask with additional mempolicy handling.
      
      alloc_huge_page_noerr has no more users and it can go.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebd63723
    • Michal Hocko's avatar
      mm, hugetlb: further simplify hugetlb allocation API · 0c397dae
      Michal Hocko authored
      Hugetlb allocator has several layer of allocation functions depending
      and the purpose of the allocation.  There are two allocators depending
      on whether the page can be allocated from the page allocator or we need
      a contiguous allocator.  This is currently opencoded in
      alloc_fresh_huge_page which is the only path that might allocate giga
      pages which require the later allocator.  Create alloc_fresh_huge_page
      which hides this implementation detail and use it in all callers which
      hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
      This shouldn't introduce any funtional change because both migration and
      surplus allocators exlude giga pages explicitly.
      
      While we are at it let's do some renaming.  The current scheme is not
      consistent and overly painfull to read and understand.  Get rid of
      prefix underscores from most functions.  There is no real reason to make
      names longer.
      
      * alloc_fresh_huge_page is the new layer to abstract underlying
        allocator
      * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
        alloc_buddy_huge_page.
      * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
        the new page directly to the pool
      * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
        as it uses alloc_fresh_huge_page now
      * others lose their excessive prefix underscores to make names shorter
      
      [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
        Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
      Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c397dae
    • Michal Hocko's avatar
      mm, hugetlb: get rid of surplus page accounting tricks · 9980d744
      Michal Hocko authored
      alloc_surplus_huge_page increases the pool size and the number of
      surplus pages opportunistically to prevent from races with the pool size
      change.  See commit d1c3fb1f ("hugetlb: introduce
      nr_overcommit_hugepages sysctl") for more details.
      
      The resulting code is unnecessarily hairy, cause code duplication and
      doesn't allow to share the allocation paths.  Moreover pool size changes
      tend to be very seldom so optimizing for them is not really reasonable.
      Simplify the code and allow to allocate a fresh surplus page as long as
      we are under the overcommit limit and then recheck the condition after
      the allocation and drop the new page if the situation has changed.  This
      should provide a reasonable guarantee that an abrupt allocation requests
      will not go way off the limit.
      
      If we consider races with the pool shrinking and enlarging then we
      should be reasonably safe as well.  In the first case we are off by one
      in the worst case and the second case should work OK because the page is
      not yet visible.  We can waste CPU cycles for the allocation but that
      should be acceptable for a relatively rare condition.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9980d744
    • Michal Hocko's avatar
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko authored
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
    • Michal Hocko's avatar
      mm, hugetlb: integrate giga hugetlb more naturally to the allocation path · d9cc948f
      Michal Hocko authored
      Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
      specie with a lot of special casing.  The allocation path is not an
      exception.  Unnecessarily so to be honest.  It is true that the
      underlying allocator is different but that is an implementation detail.
      
      This patch unifies the hugetlb allocation path that a prepares fresh
      pool pages.  alloc_fresh_gigantic_page basically copies
      alloc_fresh_huge_page logic so we can move everything there.  This will
      simplify set_max_huge_pages which doesn't have to care about what kind
      of huge page we allocate.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9cc948f
    • Michal Hocko's avatar
      mm, hugetlb: unify core page allocation accounting and initialization · af0fb9df
      Michal Hocko authored
      Patch series "mm, hugetlb: allocation API and migration improvements"
      
      Motivation:
      
      this is a follow up for [3] for the allocation API and [4] for the
      hugetlb migration.  It wasn't really easy to split those into two
      separate patch series as they share some code.
      
      My primary motivation to touch this code is to make the gigantic pages
      migration working.  The giga pages allocation code is just too fragile
      and hacked into the hugetlb code now.  This series tries to move giga
      pages closer to the first class citizen.  We are not there yet but
      having 5 patches is quite a lot already and it will already make the
      code much easier to follow.  I will come with other changes on top after
      this sees some review.
      
      The first two patches should be trivial to review.  The third patch
      changes the way how we migrate huge pages.  Newly allocated pages are a
      subject of the overcommit check and they participate surplus accounting
      which is quite unfortunate as the changelog explains.  This patch
      doesn't change anything wrt.  giga pages.
      
      Patch #4 removes the surplus accounting hack from
      __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
      deeper review is really due there.
      
      Patch #5 finally unifies allocation paths and giga pages shouldn't be
      any special anymore.  There is also some renaming going on as well.
      
      This patch (of 6):
      
      hugetlb allocator has two entry points to the page allocator
       - alloc_fresh_huge_page_node
       - __hugetlb_alloc_buddy_huge_page
      
      The two differ very subtly in two aspects.  The first one doesn't care
      about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
      prep_new_huge_page is not used because it not only initializes hugetlb
      specific stuff but because it also put_page and releases the page to the
      hugetlb pool which is not what is required in some contexts.  This makes
      things more complicated than necessary.
      
      Simplify things by a) removing the page allocator entry point duplicity
      and only keep __hugetlb_alloc_buddy_huge_page and b) make
      prep_new_huge_page more reusable by removing the put_page which moves
      the page to the allocator pool.  All current callers are updated to call
      put_page explicitly.  Later patches will add new callers which won't
      need it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0fb9df
    • Michal Hocko's avatar
      mm, hugetlb: remove hugepages_treat_as_movable sysctl · d6cb41cc
      Michal Hocko authored
      hugepages_treat_as_movable has been introduced by 396faf03 ("Allow
      huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
      allocations from ZONE_MOVABLE even when hugetlb pages were not
      migrateable.  The purpose of the movable zone was different at the time.
      It aimed at reducing memory fragmentation and hugetlb pages being long
      lived and large werre not contributing to the fragmentation so it was
      acceptable to use the zone back then.
      
      Things have changed though and the primary purpose of the zone became
      migratability guarantee.  If we allow non migrateable hugetlb pages to
      be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
      
      Remove the knob and only rely on hugepage_migration_supported to allow
      movable zones.
      
      Mel said:
      
      : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
      : the ability to grow it again.  The use case was for batched jobs, some of
      : which needed huge pages and others that did not but didn't want the memory
      : useless pinned in the huge pages pool.
      :
      : I suspect that more users rely on THP than hugetlbfs for flexible use of
      : huge pages with fallback options so I think that removing the option
      : should be ok.
      
      Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6cb41cc
    • Roman Gushchin's avatar
      mm: show total hugetlb memory consumption in /proc/meminfo · fcb2b0c5
      Roman Gushchin authored
      Currently we display some hugepage statistics (total, free, etc) in
      /proc/meminfo, but only for default hugepage size (e.g.  2Mb).
      
      If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
      /proc/meminfo output can be confusing, as non-default sized hugepages
      are not reflected at all, and there are no signs that they are existing
      and consuming system memory.
      
      To solve this problem, let's display the total amount of memory,
      consumed by hugetlb pages of all sized (both free and used).  Let's call
      it "Hugetlb", and display size in kB to match generic /proc/meminfo
      style.
      
      For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
        $ cat /proc/meminfo
        MemTotal:        8168984 kB
        MemFree:         3789276 kB
        <...>
        CmaFree:               0 kB
        HugePages_Total:    1024
        HugePages_Free:     1024
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:         4194304 kB
        DirectMap4k:       32632 kB
        DirectMap2M:     4161536 kB
        DirectMap1G:     6291456 kB
      
      Also, this patch updates corresponding docs to reflect Hugetlb entry
      meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.
      
      Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcb2b0c5
  15. 30 Nov, 2017 2 commits
  16. 16 Nov, 2017 1 commit
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
  17. 03 Nov, 2017 1 commit
  18. 07 Sep, 2017 3 commits
  19. 15 Aug, 2017 1 commit
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: Allow arch to override and call the weak function · e24a1307
      Aneesh Kumar K.V authored
      When running in guest mode ppc64 supports a different mechanism for hugetlb
      allocation/reservation. The LPAR management application called HMC can
      be used to reserve a set of hugepages and we pass the details of
      reserved pages via device tree to the guest. (more details in
      htab_dt_scan_hugepage_blocks()) . We do the memblock_reserve of the range
      and later in the boot sequence, we add the reserved range to huge_boot_pages.
      
      But to enable 16G hugetlb on baremetal config (when we are not running as guest)
      we want to do memblock reservation during boot. Generic code already does this
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e24a1307
  20. 10 Aug, 2017 1 commit