1. 12 Feb, 2015 1 commit
    • Vlastimil Babka's avatar
      mm: microoptimize zonelist operations · 05891fb0
      Vlastimil Babka authored
      next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
      via extra parameter.  Since the latter can be trivially obtained by
      dereferencing the former, the overhead of the extra parameter is
      This patch thus removes the zone parameter from next_zones_zonelist().
      Both callers happen to be in the same header file, so it's simple to add
      the zoneref dereference inline.  We save some bytes of code size.
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
      function                                     old     new   delta
      nr_free_zone_pages                           129     115     -14
      __alloc_pages_nodemask                      2300    2285     -15
      get_page_from_freelist                      2652    2576     -76
      add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
      function                                     old     new   delta
      try_to_compact_pages                         569     579     +10
      Signed-off-by: 's avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 09 Oct, 2013 2 commits
    • Peter Zijlstra's avatar
      mm: numa: Change page last {nid,pid} into {cpu,pid} · 90572890
      Peter Zijlstra authored
      Change the per page last fault tracking to use cpu,pid instead of
      nid,pid. This will allow us to try and lookup the alternate task more
      easily. Note that even though it is the cpu that is store in the page
      flags that the mpol_misplaced decision is still based on the node.
      Signed-off-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: 's avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
      [ Fixed build failure on 32-bit systems. ]
      Signed-off-by: 's avatarIngo Molnar <mingo@kernel.org>
    • Mel Gorman's avatar
      sched/numa: Set preferred NUMA node based on number of private faults · b795854b
      Mel Gorman authored
      Ideally it would be possible to distinguish between NUMA hinting faults that
      are private to a task and those that are shared. If treated identically
      there is a risk that shared pages bounce between nodes depending on
      the order they are referenced by tasks. Ultimately what is desirable is
      that task private pages remain local to the task while shared pages are
      interleaved between sharing tasks running on different nodes to give good
      average performance. This is further complicated by THP as even
      applications that partition their data may not be partitioning on a huge
      page boundary.
      To start with, this patch assumes that multi-threaded or multi-process
      applications partition their data and that in general the private accesses
      are more important for cpu->memory locality in the general case. Also,
      no new infrastructure is required to treat private pages properly but
      interleaving for shared pages requires additional infrastructure.
      To detect private accesses the pid of the last accessing task is required
      but the storage requirements are a high. This patch borrows heavily from
      Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
      to encode some bits from the last accessing task in the page flags as
      well as the node information. Collisions will occur but it is better than
      just depending on the node information. Node information is then used to
      determine if a page needs to migrate. The PID information is used to detect
      private/shared accesses. The preferred NUMA node is selected based on where
      the maximum number of approximately private faults were measured. Shared
      faults are not taken into consideration for a few reasons.
      First, if there are many tasks sharing the page then they'll all move
      towards the same node. The node will be compute overloaded and then
      scheduled away later only to bounce back again. Alternatively the shared
      tasks would just bounce around nodes because the fault information is
      effectively noise. Either way accounting for shared faults the same as
      private faults can result in lower performance overall.
      The second reason is based on a hypothetical workload that has a small
      number of very important, heavily accessed private pages but a large shared
      array. The shared array would dominate the number of faults and be selected
      as a preferred node even though it's the wrong decision.
      The third reason is that multiple threads in a process will race each
      other to fault the shared page making the fault information unreliable.
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      [ Fix complication error when !NUMA_BALANCING. ]
      Reviewed-by: 's avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: 's avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.deSigned-off-by: 's avatarIngo Molnar <mingo@kernel.org>
  3. 24 Feb, 2013 2 commits
    • Mel Gorman's avatar
      mm: rename page struct field helpers · 22b751c3
      Mel Gorman authored
      The function names page_xchg_last_nid(), page_last_nid() and
      reset_page_last_nid() were judged to be inconsistent so rename them to a
      struct_field_op style pattern.  As it looked jarring to have
      reset_page_mapcount() and page_nid_reset_last() beside each other in
      memmap_init_zone(), this patch also renames reset_page_mapcount() to
      page_mapcount_reset().  There are others like init_page_count() but as
      it is used throughout the arch code a rename would likely cause more
      conflicts than it is worth.
      [akpm@linux-foundation.org: fix zcache]
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Suggested-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Mel Gorman's avatar
      mm: uninline page_xchg_last_nid() · 4468b8f1
      Mel Gorman authored
      Andrew Morton pointed out that page_xchg_last_nid() and
      reset_page_last_nid() were "getting nuttily large" and asked that it be
      reset_page_last_nid() is on the page free path and it would be
      unfortunate to make that path more expensive than it needs to be.  Due
      to the internal use of page_xchg_last_nid() it is already too expensive
      but fortunately, it should also be impossible for the page->flags to be
      updated in parallel when we call reset_page_last_nid().  Instead of
      unlining the function, it uses a simplier implementation that assumes no
      parallel updates and should now be sufficiently short for inlining.
      page_xchg_last_nid() is called in paths that are already quite expensive
      (splitting huge page, fault handling, migration) and it is reasonable to
      uninline.  There was not really a good place to place the function but
      mm/mmzone.c was the closest fit IMO.
      This patch saved 128 bytes of text in the vmlinux file for the kernel
      configuration I used for testing automatic NUMA balancing.
      Signed-off-by: 's avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 16 Nov, 2012 1 commit
    • Hugh Dickins's avatar
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins authored
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: 's avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: 's avatarHugh Dickins <hughd@google.com>
      Acked-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: 's avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 01 Aug, 2012 1 commit
    • Andrew Morton's avatar
      memcg: rename config variables · c255a458
      Andrew Morton authored
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 29 May, 2012 1 commit
  7. 31 Oct, 2011 1 commit
  8. 14 Jan, 2011 1 commit
    • Mel Gorman's avatar
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman authored
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      vanilla                      11.6615%
      disable-threshold            0.2584%
      David said:
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: 's avatarMel Gorman <mel@csn.ul.ie>
      Reported-by: 's avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: 's avatarChristoph Lameter <cl@linux.com>
      Tested-by: 's avatarNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[, 2.6.36.x]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 10 Sep, 2010 1 commit
    • Christoph Lameter's avatar
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory... · aa454840
      Christoph Lameter authored
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
      Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
      cheaper than scanning a number of lists.  To avoid synchronization
      overhead, counter deltas are maintained on a per-cpu basis and drained
      both periodically and when the delta is above a threshold.  On large CPU
      systems, the difference between the estimated and real value of
      NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
      number of real free page in buddy, the VM can allocate pages below min
      watermark, at worst reducing the real number of pages to zero.  Even if
      the OOM killer kills some victim for freeing memory, it may not free
      memory if the exit path requires a new page resulting in livelock.
      This patch introduces a zone_page_state_snapshot() function (courtesy of
      Christoph) that takes a slightly more accurate view of an arbitrary vmstat
      counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
      the watermark being accidentally broken.  The estimate is not perfect and
      may result in cache line bounces but is expected to be lighter than the
      IPI calls necessary to continually drain the per-cpu counters while kswapd
      is awake.
      Signed-off-by: 's avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: 's avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  10. 18 May, 2009 1 commit
    • Mel Gorman's avatar
      [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 · eb33575c
      Mel Gorman authored
      pfn_valid() is meant to be able to tell if a given PFN has valid memmap
      associated with it or not. In FLATMEM, it is expected that holes always
      have valid memmap as long as there is valid PFNs either side of the hole.
      In SPARSEMEM, it is assumed that a valid section has a memmap for the
      entire section.
      However, ARM and maybe other embedded architectures in the future free
      memmap backing holes to save memory on the assumption the memmap is never
      used. The page_zone linkages are then broken even though pfn_valid()
      returns true. A walker of the full memmap must then do this additional
      check to ensure the memmap they are looking at is sane by making sure the
      zone and PFN linkages are still valid. This is expensive, but walkers of
      the full memmap are extremely rare.
      This was caught before for FLATMEM and hacked around but it hits again for
      SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
      are totally screwed. This looks like a hatchet job but the reality is that
      any clean solution would end up consumning all the memory saved by punching
      these unexpected holes in the memmap. For example, we tried marking the
      memmap within the section invalid but the section size exceeds the size of
      the hole in most cases so pfn_valid() starts returning false where valid
      memmap exists. Shrinking the size of the section would increase memory
      consumption offsetting the gains.
      This patch identifies when an architecture is punching unexpected holes
      in the memmap that the memory model cannot automatically detect and sets
      ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
      which is the model sub-architecture this has been reported on but may expand
      later. When set, walkers of the full memmap must call memmap_valid_within()
      for each PFN and passing in what it expects the page and zone to be for
      that PFN. If it finds the linkages to be broken, it assumes the memmap is
      invalid for that PFN.
      Signed-off-by: 's avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: 's avatarRussell King <rmk+kernel@arm.linux.org.uk>
  11. 13 Sep, 2008 1 commit
    • Mel Gorman's avatar
      mm: mark the correct zone as full when scanning zonelists · 5bead2a0
      Mel Gorman authored
      The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
      scanning zonelists to keep track of where in the zonelist it is.  The
      zoneref that is returned corresponds to the the next zone that is to be
      scanned, not the current one.  It was intended to be treated as an opaque
      When the page allocator is scanning a zonelist, it marks elements in the
      zonelist corresponding to zones that are temporarily full.  As the
      zonelist is being updated, it uses the cursor here;
        if (NUMA_BUILD)
              zlc_mark_zone_full(zonelist, z);
      This is intended to prevent rescanning in the near future but the zoneref
      cursor does not correspond to the zone that has been found to be full.
      This is an easy misunderstanding to make so this patch corrects the
      problem by changing zoneref cursor to be the current zone being scanned
      instead of the next one.
      Signed-off-by: 's avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.26.x]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 28 Apr, 2008 1 commit
    • Mel Gorman's avatar
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman authored
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: 's avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: 's avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: 's avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 07 Dec, 2006 1 commit
  14. 10 Jul, 2006 1 commit
  15. 30 Jun, 2006 1 commit
  16. 27 Mar, 2006 1 commit