1. 02 Aug, 2018 1 commit
    • Jane Chu's avatar
      ipc/shm.c add ->pagesize function to shm_vm_ops · eec3636a
      Jane Chu authored
      Commit 05ea8860 ("mm, hugetlbfs: introduce ->pagesize() to
      vm_operations_struct") adds a new ->pagesize() function to
      hugetlb_vm_ops, intended to cover all hugetlbfs backed files.
      
      With System V shared memory model, if "huge page" is specified, the
      "shared memory" is backed by hugetlbfs files, but the mappings initiated
      via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
      so we need to add a ->pagesize function to shm_vm_ops.  Otherwise,
      vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
      result in below BUG:
      
        fs/hugetlbfs/inode.c
              443             if (unlikely(page_mapped(page))) {
              444                     BUG_ON(truncate_op);
      
      resulting in
      
        hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
        ------------[ cut here ]------------
        kernel BUG at fs/hugetlbfs/inode.c:444!
        Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
        CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
        RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
        ....
        Call Trace:
          hugetlbfs_evict_inode+0x1e/0x3e
          evict+0xdb/0x1af
          iput+0x1a2/0x1f7
          dentry_unlink_inode+0xc6/0xf0
          __dentry_kill+0xd8/0x18d
          dput+0x1b5/0x1ed
          __fput+0x18b/0x216
          ____fput+0xe/0x10
          task_work_run+0x90/0xa7
          exit_to_usermode_loop+0xdd/0x116
          do_syscall_64+0x187/0x1ae
          entry_SYSCALL_64_after_hwframe+0x150/0x0
      
      [jane.chu@oracle.com: relocate comment]
        Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
      Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
      Fixes: 05ea8860 ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
      Signed-off-by: 's avatarJane Chu <jane.chu@oracle.com>
      Suggested-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: 's avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      eec3636a
  2. 04 Jul, 2018 1 commit
  3. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook authored
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: 's avatarKees Cook <keescook@chromium.org>
      6da2ec56
  4. 08 Jun, 2018 2 commits
    • Huang Ying's avatar
      mm, hugetlbfs: pass fault address to no page handler · 285b8dca
      Huang Ying authored
      This is to take better advantage of general huge page clearing
      optimization (commit c79b57e4: "mm: hugetlb: clear target sub-page
      last when clearing huge page") for hugetlbfs.
      
      In the general optimization patch, the sub-page to access will be
      cleared last to avoid the cache lines of to access sub-page to be
      evicted when clearing other sub-pages.  This works better if we have the
      address of the sub-page to access, that is, the fault address inside the
      huge page.  So the hugetlbfs no page fault handler is changed to pass
      that information.  This will benefit workloads which don't access the
      begin of the hugetlbfs huge page after the page fault under heavy cache
      contention for shared last level cache.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on hugetlbfs.
      
      With this patch, the throughput increases ~28.1% in vm-scalability
      anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
      system (44 cores, 88 threads).  The test case creates 88 processes, each
      process mmaps a big anonymous memory area with MAP_HUGETLB and writes to
      it from the end to the begin.  For each process, other processes could
      be seen as other workload which generates heavy cache pressure.  At the
      same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC
      (instruction per cycle) increased from 0.3 to 0.37, and the time spent
      in user space is reduced ~19.3%.
      
      Link: http://lkml.kernel.org/r/20180517083539.9242-1-ying.huang@intel.comSigned-off-by: 's avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Punit Agrawal <punit.agrawal@arm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      285b8dca
    • Souptick Joarder's avatar
      mm: change return type to vm_fault_t · b3ec9f33
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler in struct
      vm_operations_struct.  For now, this is just documenting that the
      function returns a VM_FAULT value rather than an errno.  Once all
      instances are converted, vm_fault_t will become a distinct type.
      
      See commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PCSigned-off-by: 's avatarSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: 's avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ec9f33
  5. 16 Apr, 2018 1 commit
  6. 06 Apr, 2018 2 commits
  7. 23 Mar, 2018 1 commit
  8. 10 Mar, 2018 1 commit
  9. 01 Feb, 2018 9 commits
    • Michal Hocko's avatar
      hugetlb, mbind: fall back to default policy if vma is NULL · 389c8178
      Michal Hocko authored
      Dan Carpenter has noticed that mbind migration callback (new_page) can
      get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
      relies on the VMA to get the hstate.  We used to BUG_ON this case but
      the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
      mbind hugetlb migration".
      
      The proper way to handle this is to get the hstate from the migrated
      page and rely on huge_node (resp.  get_vma_policy) do the right thing
      with null VMA.  We are currently falling back to the default mempolicy
      in that case which is in line what THP path is doing here.
      
      Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.czSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reported-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      389c8178
    • Michal Hocko's avatar
      hugetlb, mempolicy: fix the mbind hugetlb migration · ebd63723
      Michal Hocko authored
      do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
      pages.  alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
      allocation function which has to take care of reserves, overcommit or
      hugetlb cgroup accounting.  None of that is really required for the page
      migration because the new page is only temporal and either will replace
      the original page or it will be dropped.  This is essentially as for
      other migration call paths and there shouldn't be any reason to handle
      mbind in a special way.
      
      The current implementation is even suboptimal because the migration
      might fail just because the hugetlb cgroup limit is reached, or the
      overcommit is saturated.
      
      Fix this by making mbind like other hugetlb migration paths.  Add a new
      migration helper alloc_huge_page_vma as a wrapper around
      alloc_huge_page_nodemask with additional mempolicy handling.
      
      alloc_huge_page_noerr has no more users and it can go.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebd63723
    • Michal Hocko's avatar
      mm, hugetlb: further simplify hugetlb allocation API · 0c397dae
      Michal Hocko authored
      Hugetlb allocator has several layer of allocation functions depending
      and the purpose of the allocation.  There are two allocators depending
      on whether the page can be allocated from the page allocator or we need
      a contiguous allocator.  This is currently opencoded in
      alloc_fresh_huge_page which is the only path that might allocate giga
      pages which require the later allocator.  Create alloc_fresh_huge_page
      which hides this implementation detail and use it in all callers which
      hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
      This shouldn't introduce any funtional change because both migration and
      surplus allocators exlude giga pages explicitly.
      
      While we are at it let's do some renaming.  The current scheme is not
      consistent and overly painfull to read and understand.  Get rid of
      prefix underscores from most functions.  There is no real reason to make
      names longer.
      
      * alloc_fresh_huge_page is the new layer to abstract underlying
        allocator
      * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
        alloc_buddy_huge_page.
      * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
        the new page directly to the pool
      * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
        as it uses alloc_fresh_huge_page now
      * others lose their excessive prefix underscores to make names shorter
      
      [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
        Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
      Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c397dae
    • Michal Hocko's avatar
      mm, hugetlb: get rid of surplus page accounting tricks · 9980d744
      Michal Hocko authored
      alloc_surplus_huge_page increases the pool size and the number of
      surplus pages opportunistically to prevent from races with the pool size
      change.  See commit d1c3fb1f ("hugetlb: introduce
      nr_overcommit_hugepages sysctl") for more details.
      
      The resulting code is unnecessarily hairy, cause code duplication and
      doesn't allow to share the allocation paths.  Moreover pool size changes
      tend to be very seldom so optimizing for them is not really reasonable.
      Simplify the code and allow to allocate a fresh surplus page as long as
      we are under the overcommit limit and then recheck the condition after
      the allocation and drop the new page if the situation has changed.  This
      should provide a reasonable guarantee that an abrupt allocation requests
      will not go way off the limit.
      
      If we consider races with the pool shrinking and enlarging then we
      should be reasonably safe as well.  In the first case we are off by one
      in the worst case and the second case should work OK because the page is
      not yet visible.  We can waste CPU cycles for the allocation but that
      should be acceptable for a relatively rare condition.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      9980d744
    • Michal Hocko's avatar
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko authored
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
    • Michal Hocko's avatar
      mm, hugetlb: integrate giga hugetlb more naturally to the allocation path · d9cc948f
      Michal Hocko authored
      Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
      specie with a lot of special casing.  The allocation path is not an
      exception.  Unnecessarily so to be honest.  It is true that the
      underlying allocator is different but that is an implementation detail.
      
      This patch unifies the hugetlb allocation path that a prepares fresh
      pool pages.  alloc_fresh_gigantic_page basically copies
      alloc_fresh_huge_page logic so we can move everything there.  This will
      simplify set_max_huge_pages which doesn't have to care about what kind
      of huge page we allocate.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9cc948f
    • Michal Hocko's avatar
      mm, hugetlb: unify core page allocation accounting and initialization · af0fb9df
      Michal Hocko authored
      Patch series "mm, hugetlb: allocation API and migration improvements"
      
      Motivation:
      
      this is a follow up for [3] for the allocation API and [4] for the
      hugetlb migration.  It wasn't really easy to split those into two
      separate patch series as they share some code.
      
      My primary motivation to touch this code is to make the gigantic pages
      migration working.  The giga pages allocation code is just too fragile
      and hacked into the hugetlb code now.  This series tries to move giga
      pages closer to the first class citizen.  We are not there yet but
      having 5 patches is quite a lot already and it will already make the
      code much easier to follow.  I will come with other changes on top after
      this sees some review.
      
      The first two patches should be trivial to review.  The third patch
      changes the way how we migrate huge pages.  Newly allocated pages are a
      subject of the overcommit check and they participate surplus accounting
      which is quite unfortunate as the changelog explains.  This patch
      doesn't change anything wrt.  giga pages.
      
      Patch #4 removes the surplus accounting hack from
      __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
      deeper review is really due there.
      
      Patch #5 finally unifies allocation paths and giga pages shouldn't be
      any special anymore.  There is also some renaming going on as well.
      
      This patch (of 6):
      
      hugetlb allocator has two entry points to the page allocator
       - alloc_fresh_huge_page_node
       - __hugetlb_alloc_buddy_huge_page
      
      The two differ very subtly in two aspects.  The first one doesn't care
      about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
      prep_new_huge_page is not used because it not only initializes hugetlb
      specific stuff but because it also put_page and releases the page to the
      hugetlb pool which is not what is required in some contexts.  This makes
      things more complicated than necessary.
      
      Simplify things by a) removing the page allocator entry point duplicity
      and only keep __hugetlb_alloc_buddy_huge_page and b) make
      prep_new_huge_page more reusable by removing the put_page which moves
      the page to the allocator pool.  All current callers are updated to call
      put_page explicitly.  Later patches will add new callers which won't
      need it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0fb9df
    • Michal Hocko's avatar
      mm, hugetlb: remove hugepages_treat_as_movable sysctl · d6cb41cc
      Michal Hocko authored
      hugepages_treat_as_movable has been introduced by 396faf03 ("Allow
      huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
      allocations from ZONE_MOVABLE even when hugetlb pages were not
      migrateable.  The purpose of the movable zone was different at the time.
      It aimed at reducing memory fragmentation and hugetlb pages being long
      lived and large werre not contributing to the fragmentation so it was
      acceptable to use the zone back then.
      
      Things have changed though and the primary purpose of the zone became
      migratability guarantee.  If we allow non migrateable hugetlb pages to
      be in ZONE_MOVABLE memory hotplug might fail to offline the memory.
      
      Remove the knob and only rely on hugepage_migration_supported to allow
      movable zones.
      
      Mel said:
      
      : Primarily it was aimed at allowing the hugetlb pool to safely shrink with
      : the ability to grow it again.  The use case was for batched jobs, some of
      : which needed huge pages and others that did not but didn't want the memory
      : useless pinned in the huge pages pool.
      :
      : I suspect that more users rely on THP than hugetlbfs for flexible use of
      : huge pages with fallback options so I think that removing the option
      : should be ok.
      
      Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Reported-by: 's avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Acked-by: 's avatarMel Gorman <mgorman@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6cb41cc
    • Roman Gushchin's avatar
      mm: show total hugetlb memory consumption in /proc/meminfo · fcb2b0c5
      Roman Gushchin authored
      Currently we display some hugepage statistics (total, free, etc) in
      /proc/meminfo, but only for default hugepage size (e.g.  2Mb).
      
      If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
      /proc/meminfo output can be confusing, as non-default sized hugepages
      are not reflected at all, and there are no signs that they are existing
      and consuming system memory.
      
      To solve this problem, let's display the total amount of memory,
      consumed by hugetlb pages of all sized (both free and used).  Let's call
      it "Hugetlb", and display size in kB to match generic /proc/meminfo
      style.
      
      For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
        $ cat /proc/meminfo
        MemTotal:        8168984 kB
        MemFree:         3789276 kB
        <...>
        CmaFree:               0 kB
        HugePages_Total:    1024
        HugePages_Free:     1024
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:         4194304 kB
        DirectMap4k:       32632 kB
        DirectMap2M:     4161536 kB
        DirectMap1G:     6291456 kB
      
      Also, this patch updates corresponding docs to reflect Hugetlb entry
      meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.
      
      Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.comSigned-off-by: 's avatarRoman Gushchin <guro@fb.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: 's avatarDavid Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcb2b0c5
  10. 30 Nov, 2017 2 commits
  11. 16 Nov, 2017 1 commit
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: 's avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
  12. 03 Nov, 2017 1 commit
  13. 07 Sep, 2017 3 commits
  14. 15 Aug, 2017 1 commit
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: Allow arch to override and call the weak function · e24a1307
      Aneesh Kumar K.V authored
      When running in guest mode ppc64 supports a different mechanism for hugetlb
      allocation/reservation. The LPAR management application called HMC can
      be used to reserve a set of hugepages and we pass the details of
      reserved pages via device tree to the guest. (more details in
      htab_dt_scan_hugepage_blocks()) . We do the memblock_reserve of the range
      and later in the boot sequence, we add the reserved range to huge_boot_pages.
      
      But to enable 16G hugetlb on baremetal config (when we are not running as guest)
      we want to do memblock reservation during boot. Generic code already does this
      Signed-off-by: 's avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarMichael Ellerman <mpe@ellerman.id.au>
      e24a1307
  15. 10 Aug, 2017 1 commit
  16. 02 Aug, 2017 1 commit
    • Daniel Jordan's avatar
      mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors · 2be7cfed
      Daniel Jordan authored
      Commit 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when
      FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
      errors from follow_hugetlb_page.  After such error, __get_user_pages
      subsequently calls faultin_page on the same VMA and start address that
      follow_hugetlb_page failed on instead of returning the error immediately
      as it should.
      
      In follow_hugetlb_page, when hugetlb_fault returns a value covered under
      VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
      to 0 as __get_user_pages expects in this case, which causes the
      following to happen in __get_user_pages: the "while (nr_pages)" check
      succeeds, we skip the "if (!vma..." check because we got a VMA the last
      time around, we find no page with follow_page_mask, and we call
      faultin_page, which calls hugetlb_fault for the second time.
      
      This issue also slightly changes how __get_user_pages works.  Before, it
      only returned error if it had made no progress (i = 0).  But now,
      follow_hugetlb_page can clobber "i" with an error code since its new
      return path doesn't check for progress.  So if "i" is nonzero before a
      failing call to follow_hugetlb_page, that indication of progress is lost
      and __get_user_pages can return error even if some pages were
      successfully pinned.
      
      To fix this, change follow_hugetlb_page so that it updates nr_pages,
      allowing __get_user_pages to fail immediately and restoring the "error
      only if no progress" behavior to __get_user_pages.
      
      Tested that __get_user_pages returns when expected on error from
      hugetlb_fault in follow_hugetlb_page.
      
      Fixes: 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
      Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.comSigned-off-by: 's avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: 's avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: <stable@vger.kernel.org>	[4.12.x]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      2be7cfed
  17. 12 Jul, 2017 1 commit
    • Michal Hocko's avatar
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko authored
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  18. 10 Jul, 2017 9 commits
    • Michal Hocko's avatar
      hugetlb: add support for preferred node to alloc_huge_page_nodemask · 3e59fcb0
      Michal Hocko authored
      alloc_huge_page_nodemask tries to allocate from any numa node in the
      allowed node mask starting from lower numa nodes.  This might lead to
      filling up those low NUMA nodes while others are not used.  We can
      reduce this risk by introducing a concept of the preferred node similar
      to what we have in the regular page allocator.  We will start allocating
      from the preferred nid and then iterate over all allowed nodes in the
      zonelist order until we try them all.
      
      This is mimicing the page allocator logic except it operates on per-node
      mempools.  dequeue_huge_page_vma already does this so distill the
      zonelist logic into a more generic dequeue_huge_page_nodemask and use it
      in alloc_huge_page_nodemask.
      
      This will allow us to use proper per numa distance fallback also for
      alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
      can get rid of alloc_huge_page_node helper which doesn't have any user
      anymore.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e59fcb0
    • Michal Hocko's avatar
      mm, hugetlb: unclutter hugetlb allocation layers · aaf14e40
      Michal Hocko authored
      Patch series "mm, hugetlb: allow proper node fallback dequeue".
      
      While working on a hugetlb migration issue addressed in a separate
      patchset[1] I have noticed that the hugetlb allocations from the
      preallocated pool are quite subotimal.
      
       [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
      
      There is no fallback mechanism implemented and no notion of preferred
      node.  I have tried to work around it but Vlastimil was right to push
      back for a more robust solution.  It seems that such a solution is to
      reuse zonelist approach we use for the page alloctor.
      
      This series has 3 patches.  The first one tries to make hugetlb
      allocation layers more clear.  The second one implements the zonelist
      hugetlb pool allocation and introduces a preferred node semantic which
      is used by the migration callbacks.  The last patch is a clean up.
      
      This patch (of 3):
      
      Hugetlb allocation path for fresh huge pages is unnecessarily complex
      and it mixes different interfaces between layers.
      
      __alloc_buddy_huge_page is the central place to perform a new
      allocation.  It checks for the hugetlb overcommit and then relies on
      __hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
      all good except that __alloc_buddy_huge_page pushes vma and address down
      the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
      two different allocation modes - one for memory policy and other node
      specific (or to make it more obscure node non-specific) requests.
      
      This just screams for a reorganization.
      
      This patch pulls out all the vma specific handling up to
      __alloc_buddy_huge_page_with_mpol where it belongs.
      __alloc_buddy_huge_page will get nodemask argument and
      __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
      page allocator.
      
      In short:
      __alloc_buddy_huge_page_with_mpol - memory policy handling
        __alloc_buddy_huge_page - overcommit handling and accounting
          __hugetlb_alloc_buddy_huge_page - page allocator layer
      
      Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
      is not really needed because the page allocator already handles the
      cpusets update.
      
      Finally __hugetlb_alloc_buddy_huge_page had a special case for node
      specific allocations (when no policy is applied and there is a node
      given).  This has relied on __GFP_THISNODE to not fallback to a different
      node.  alloc_huge_page_node is the only caller which relies on this
      behavior so move the __GFP_THISNODE there.
      
      Not only does this remove quite some code it also should make those
      layers easier to follow and clear wrt responsibilities.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      aaf14e40
    • Matthew Wilcox's avatar
      mm/hugetlb.c: replace memfmt with string_get_size · c6247f72
      Matthew Wilcox authored
      The hugetlb code has its own function to report human-readable sizes.
      Convert it to use the shared string_get_size() function.  This will lead
      to a minor difference in user visible output (MiB/GiB instead of MB/GB),
      but some would argue that's desirable anyway.
      
      Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.orgSigned-off-by: 's avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6247f72
    • David Rientjes's avatar
      mm, hugetlb: schedule when potentially allocating many hugepages · 69ed779a
      David Rientjes authored
      A few hugetlb allocators loop while calling the page allocator and can
      potentially prevent rescheduling if the page allocator slowpath is not
      utilized.
      
      Conditionally schedule when large numbers of hugepages can be allocated.
      
      Anshuman:
       "Fixes a task which was getting hung while writing like 10000 hugepages
        (16MB on POWER8) into /proc/sys/vm/nr_hugepages."
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.comSigned-off-by: 's avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: 's avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: 's avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      69ed779a
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.orgSigned-off-by: 's avatarMichal Hocko <mhocko@suse.com>
      Acked-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • Liam R. Howlett's avatar
      mm/hugetlb.c: warn the user when issues arise on boot due to hugepages · d715cf80
      Liam R. Howlett authored
      When the user specifies too many hugepages or an invalid
      default_hugepagesz the communication to the user is implicit in the
      allocation message.  This patch adds a warning when the desired page
      count is not allocated and prints an error when the default_hugepagesz
      is invalid on boot.
      
      During boot hugepages will allocate until there is a fraction of the
      hugepage size left.  That is, we allocate until either the request is
      satisfied or memory for the pages is exhausted.  When memory for the
      pages is exhausted, it will most likely lead to the system failing with
      the OOM manager not finding enough (or anything) to kill (unless you're
      using really big hugepages in the order of 100s of MB or in the GBs).
      The user will most likely see the OOM messages much later in the boot
      sequence than the implicitly stated message.  Worse yet, you may even
      get an OOM for each processor which causes many pages of OOMs on modern
      systems.  Although these messages will be printed earlier than the OOM
      messages, at least giving the user errors and warnings will highlight
      the configuration as an issue.  I'm trying to point the user in the
      right direction by providing a more robust statement of what is failing.
      
      During the sysctl or echo command, the user can check the results much
      easier than if the system hangs during boot and the scenario of having
      nothing to OOM for kernel memory is highly unlikely.
      
      Mike said:
       "Before sending out this patch, I asked Liam off list why he was doing
        it. Was it something he just thought would be useful? Or, was there
        some type of user situation/need. He said that he had been called in
        to assist on several occasions when a system OOMed during boot. In
        almost all of these situations, the user had grossly misconfigured
        huge pages.
      
        DB users want to pre-allocate just the right amount of huge pages, but
        sometimes they can be really off. In such situations, the huge page
        init code just allocates as many huge pages as it can and reports the
        number allocated. There is no indication that it quit allocating
        because it ran out of memory. Of course, a user could compare the
        number in the message to what they requested on the command line to
        determine if they got all the huge pages they requested. The thought
        was that it would be useful to at least flag this situation. That way,
        the user might be able to better relate the huge page allocation
        failure to the OOM.
      
        I'm not sure if the e-mail discussion made it obvious that this is
        something he has seen on several occasions.
      
        I see Michal's point that this will only flag the situation where
        someone configures huge pages very badly. And, a more extensive look
        at the situation of misconfiguring huge pages might be in order. But,
        this has happened on several occasions which led to the creation of
        this patch"
      
      [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
      Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.comSigned-off-by: 's avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      d715cf80
    • Naoya Horiguchi's avatar
    • Anshuman Khandual's avatar
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual authored
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: 's avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: 's avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3114a84
    • Naoya Horiguchi's avatar
      mm: hugetlb: prevent reuse of hwpoisoned free hugepages · 243abd5b
      Naoya Horiguchi authored
      Patch series "mm: hwpoison: fixlet for hugetlb migration".
      
      This patchset updates the hwpoison/hugetlb code to address 2 reported
      issues.
      
      One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
      http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
      half was already fixed in mainline, and another half about hugetlb cases
      are solved in this series.
      
      Another issue is "narrow-down error affected region into a single 4kB
      page instead of a whole hugetlb page" issue, which was tried by Anshuman
      (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
      and I updated it to apply it more widely.
      
      This patch (of 9):
      
      We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
      as we did before.  So current dequeue_huge_page_node() doesn't work as
      intended because it still uses is_migrate_isolate_page() for this check.
      This patch fixes it with PageHWPoison flag.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: 's avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      243abd5b
  19. 06 Jul, 2017 1 commit