1. 14 Apr, 2015 2 commits
  2. 25 Mar, 2015 1 commit
    • Mark Rutland's avatar
      mm/slub: fix lockups on PREEMPT && !SMP kernels · 859b7a0e
      Mark Rutland authored
      Commit 9aabf810 ("mm/slub: optimize alloc/free fastpath by removing
      preemption on/off") introduced an occasional hang for kernels built with
      CONFIG_PREEMPT && !CONFIG_SMP.
      
      The problem is the following loop the patch introduced to
      slab_alloc_node and slab_free:
      
          do {
              tid = this_cpu_read(s->cpu_slab->tid);
              c = raw_cpu_ptr(s->cpu_slab);
          } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
      
      GCC 4.9 has been observed to hoist the load of c and c->tid above the
      loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
      constant and does not force a reload).  On arm64 the generated assembly
      looks like:
      
               ldr     x4, [x0,#8]
        loop:
               ldr     x1, [x0,#8]
               cmp     x1, x4
               b.ne    loop
      
      If the thread is preempted between the load of c->tid (into x1) and tid
      (into x4), and an allocation or free occurs in another thread (bumping
      the cpu_slab's tid), the thread will be stuck in the loop until
      s->cpu_slab->tid wraps, which may be forever in the absence of
      allocations/frees on the same CPU.
      
      This patch changes the loop condition to access c->tid with READ_ONCE.
      This ensures that the value is reloaded even when the compiler would
      otherwise assume it could cache the value, and also ensures that the
      load will not be torn.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      859b7a0e
  3. 14 Feb, 2015 4 commits
    • Andrey Ryabinin's avatar
      mm: slub: add kernel address sanitizer support for slub allocator · 0316bec2
      Andrey Ryabinin authored
      With this patch kasan will be able to catch bugs in memory allocated by
      slub.  Initially all objects in newly allocated slab page, marked as
      redzone.  Later, when allocation of slub object happens, requested by
      caller number of bytes marked as accessible, and the rest of the object
      (including slub's metadata) marked as redzone (inaccessible).
      
      We also mark object as accessible if ksize was called for this object.
      There is some places in kernel where ksize function is called to inquire
      size of really allocated area.  Such callers could validly access whole
      allocated memory, so it should be marked as accessible.
      
      Code in slub.c and slab_common.c files could validly access to object's
      metadata, so instrumentation for this files are disabled.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Signed-off-by: default avatarDmitry Chernenkov <dmitryc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: default avatarAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0316bec2
    • Andrey Ryabinin's avatar
      mm: slub: introduce metadata_access_enable()/metadata_access_disable() · a79316c6
      Andrey Ryabinin authored
      It's ok for slub to access memory that marked by kasan as inaccessible
      (object's metadata).  Kasan shouldn't print report in that case because
      these accesses are valid.  Disabling instrumentation of slub.c code is not
      enough to achieve this because slub passes pointer to object's metadata
      into external functions like memchr_inv().
      
      We don't want to disable instrumentation for memchr_inv() because this is
      quite generic function, and we don't want to miss bugs.
      
      metadata_access_enable/metadata_access_disable used to tell KASan where
      accesses to metadata starts/end, so we could temporarily disable KASan
      reports.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a79316c6
    • Andrey Ryabinin's avatar
      mm: slub: share object_err function · 75c66def
      Andrey Ryabinin authored
      Remove static and add function declarations to linux/slub_def.h so it
      could be used by kernel address sanitizer.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75c66def
    • Tejun Heo's avatar
      slub: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 5024c1d7
      Tejun Heo authored
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      
      * This is an equivalent conversion but the whole function should be
        converted to use scnprinf famiily of functions rather than
        performing custom output length predictions in multiple places.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5024c1d7
  4. 13 Feb, 2015 5 commits
    • Vladimir Davydov's avatar
      slub: make dead caches discard free slabs immediately · d6e0b7fa
      Vladimir Davydov authored
      To speed up further allocations SLUB may store empty slabs in per cpu/node
      partial lists instead of freeing them immediately.  This prevents per
      memcg caches destruction, because kmem caches created for a memory cgroup
      are only destroyed after the last page charged to the cgroup is freed.
      
      To fix this issue, this patch resurrects approach first proposed in [1].
      It forbids SLUB to cache empty slabs after the memory cgroup that the
      cache belongs to was destroyed.  It is achieved by setting kmem_cache's
      cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
      that it would drop frozen empty slabs immediately if cpu_partial = 0.
      
      The runtime overhead is minimal.  From all the hot functions, we only
      touch relatively cold put_cpu_partial(): we make it call
      unfreeze_partials() after freezing a slab that belongs to an offline
      memory cgroup.  Since slab freezing exists to avoid moving slabs from/to a
      partial list on free/alloc, and there can't be allocations from dead
      caches, it shouldn't cause any overhead.  We do have to disable preemption
      for put_cpu_partial() to achieve that though.
      
      The original patch was accepted well and even merged to the mm tree.
      However, I decided to withdraw it due to changes happening to the memcg
      core at that time.  I had an idea of introducing per-memcg shrinkers for
      kmem caches, but now, as memcg has finally settled down, I do not see it
      as an option, because SLUB shrinker would be too costly to call since SLUB
      does not keep free slabs on a separate list.  Besides, we currently do not
      even call per-memcg shrinkers for offline memcgs.  Overall, it would
      introduce much more complexity to both SLUB and memcg than this small
      patch.
      
      Regarding to SLAB, there's no problem with it, because it shrinks
      per-cpu/node caches periodically.  Thanks to list_lru reparenting, we no
      longer keep entries for offline cgroups in per-memcg arrays (such as
      memcg_cache_params->memcg_caches), so we do not have to bother if a
      per-memcg cache will be shrunk a bit later than it could be.
      
      [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6e0b7fa
    • Vladimir Davydov's avatar
      slub: fix kmem_cache_shrink return value · ce3712d7
      Vladimir Davydov authored
      It is supposed to return 0 if the cache has no remaining objects and 1
      otherwise, while currently it always returns 0.  Fix it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce3712d7
    • Vladimir Davydov's avatar
      slub: never fail to shrink cache · 832f37f5
      Vladimir Davydov authored
      SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but
      also tries to rearrange the partial lists to place slabs filled up most to
      the head to cope with fragmentation.  To achieve that, it allocates a
      temporary array of lists used to sort slabs by the number of objects in
      use.  If the allocation fails, the whole procedure is aborted.
      
      This is unacceptable for the kernel memory accounting extension of the
      memory cgroup, where we want to make sure that kmem_cache_shrink()
      successfully discarded empty slabs.  Although the allocation failure is
      utterly unlikely with the current page allocator implementation, which
      retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not
      to rely on that.
      
      This patch therefore makes __kmem_cache_shrink() allocate the array on
      stack instead of calling kmalloc, which may fail.  The array size is
      chosen to be equal to 32, because most SLUB caches store not more than 32
      objects per slab page.  Slab pages with <= 32 free objects are sorted
      using the array by the number of objects in use and promoted to the head
      of the partial list, while slab pages with > 32 free objects are left in
      the end of the list without any ordering imposed on them.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      832f37f5
    • Vladimir Davydov's avatar
      slab: link memcg caches of the same kind into a list · 426589f5
      Vladimir Davydov authored
      Sometimes, we need to iterate over all memcg copies of a particular root
      kmem cache.  Currently, we use memcg_cache_params->memcg_caches array for
      that, because it contains all existing memcg caches.
      
      However, it's a bad practice to keep all caches, including those that
      belong to offline cgroups, in this array, because it will be growing
      beyond any bounds then.  I'm going to wipe away dead caches from it to
      save space.  To still be able to perform iterations over all memcg caches
      of the same kind, let us link them into a list.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      426589f5
    • Vladimir Davydov's avatar
      slab: embed memcg_cache_params to kmem_cache · f7ce3190
      Vladimir Davydov authored
      Currently, kmem_cache stores a pointer to struct memcg_cache_params
      instead of embedding it.  The rationale is to save memory when kmem
      accounting is disabled.  However, the memcg_cache_params has shrivelled
      drastically since it was first introduced:
      
      * Initially:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct kmem_cache *memcg_caches[0];
      		struct {
      			struct mem_cgroup *memcg;
      			struct list_head list;
      			struct kmem_cache *root_cache;
      			bool dead;
      			atomic_t nr_pages;
      			struct work_struct destroy;
      		};
      	};
      };
      
      * Now:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct {
      			struct rcu_head rcu_head;
      			struct kmem_cache *memcg_caches[0];
      		};
      		struct {
      			struct mem_cgroup *memcg;
      			struct kmem_cache *root_cache;
      		};
      	};
      };
      
      So the memory saving does not seem to be a clear win anymore.
      
      OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
      it results in touching one more cache line on kmem alloc/free hot paths.
      Besides, it makes linking kmem caches in a list chained by a field of
      struct memcg_cache_params really painful due to a level of indirection,
      while I want to make them linked in the following patch.  That said, let
      us embed it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ce3190
  5. 10 Feb, 2015 2 commits
    • Kim Phillips's avatar
    • Joonsoo Kim's avatar
      mm/slub: optimize alloc/free fastpath by removing preemption on/off · 9aabf810
      Joonsoo Kim authored
      We had to insert a preempt enable/disable in the fastpath a while ago in
      order to guarantee that tid and kmem_cache_cpu are retrieved on the same
      cpu.  It is the problem only for CONFIG_PREEMPT in which scheduler can
      move the process to other cpu during retrieving data.
      
      Now, I reach the solution to remove preempt enable/disable in the
      fastpath.  If tid is matched with kmem_cache_cpu's tid after tid and
      kmem_cache_cpu are retrieved by separate this_cpu operation, it means
      that they are retrieved on the same cpu.  If not matched, we just have
      to retry it.
      
      With this guarantee, preemption enable/disable isn't need at all even if
      CONFIG_PREEMPT, so this patch removes it.
      
      I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in
      CONFIG_PREEMPT.  (14.821 ns -> 14.049 ns)
      
      Below is the result of Christoph's slab_test reported by Jesper Dangaard
      Brouer.
      
      * Before
      
       Single thread testing
       =====================
       1. Kmalloc: Repeatedly allocate then free test
       10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
       10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
       10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
       10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
       10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
       10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
       10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
       10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
       10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
       10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
       10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
       10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
       2. Kmalloc: alloc/free test
       10000 times kmalloc(8)/kfree -> 68 cycles
       10000 times kmalloc(16)/kfree -> 68 cycles
       10000 times kmalloc(32)/kfree -> 69 cycles
       10000 times kmalloc(64)/kfree -> 68 cycles
       10000 times kmalloc(128)/kfree -> 68 cycles
       10000 times kmalloc(256)/kfree -> 68 cycles
       10000 times kmalloc(512)/kfree -> 74 cycles
       10000 times kmalloc(1024)/kfree -> 75 cycles
       10000 times kmalloc(2048)/kfree -> 74 cycles
       10000 times kmalloc(4096)/kfree -> 74 cycles
       10000 times kmalloc(8192)/kfree -> 75 cycles
       10000 times kmalloc(16384)/kfree -> 510 cycles
      
      * After
      
       Single thread testing
       =====================
       1. Kmalloc: Repeatedly allocate then free test
       10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
       10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
       10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
       10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
       10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
       10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
       10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
       10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
       10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
       10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
       10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
       10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
       2. Kmalloc: alloc/free test
       10000 times kmalloc(8)/kfree -> 65 cycles
       10000 times kmalloc(16)/kfree -> 66 cycles
       10000 times kmalloc(32)/kfree -> 65 cycles
       10000 times kmalloc(64)/kfree -> 66 cycles
       10000 times kmalloc(128)/kfree -> 66 cycles
       10000 times kmalloc(256)/kfree -> 71 cycles
       10000 times kmalloc(512)/kfree -> 72 cycles
       10000 times kmalloc(1024)/kfree -> 71 cycles
       10000 times kmalloc(2048)/kfree -> 71 cycles
       10000 times kmalloc(4096)/kfree -> 71 cycles
       10000 times kmalloc(8192)/kfree -> 65 cycles
       10000 times kmalloc(16384)/kfree -> 511 cycles
      
      Most of the results are better than before.
      
      Note that this change slightly worses performance in !CONFIG_PREEMPT,
      roughly 0.3%.  Implementing each case separately would help performance,
      but, since it's so marginal, I didn't do that.  This would help
      maintanance since we have same code for all cases.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9aabf810
  6. 13 Dec, 2014 2 commits
    • Vladimir Davydov's avatar
      slub: fix cpuset check in get_any_partial · dee2f8aa
      Vladimir Davydov authored
      If we fail to allocate from the current node's stock, we look for free
      objects on other nodes before calling the page allocator (see
      get_any_partial).  While checking other nodes we respect cpuset
      constraints by calling cpuset_zone_allowed.  We enforce hardwall check.
      As a result, we will fallback to the page allocator even if there are some
      pages cached on other nodes, but the current cpuset doesn't have them set.
       However, the page allocator uses softwall check for kernel allocations,
      so it may allocate from one of the other nodes in this case.
      
      Therefore we should use softwall cpuset check in get_any_partial to
      conform with the cpuset check in the page allocator.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dee2f8aa
    • Vladimir Davydov's avatar
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov authored
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
  7. 11 Dec, 2014 3 commits
  8. 27 Oct, 2014 1 commit
    • Vladimir Davydov's avatar
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov authored
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      344736f2
  9. 10 Oct, 2014 3 commits
  10. 07 Aug, 2014 8 commits
  11. 03 Jul, 2014 1 commit
  12. 06 Jun, 2014 1 commit
  13. 04 Jun, 2014 7 commits
    • Christoph Lameter's avatar
      mm: replace __get_cpu_var uses with this_cpu_ptr · 7c8e0181
      Christoph Lameter authored
      Replace places where __get_cpu_var() is used for an address calculation
      with this_cpu_ptr().
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c8e0181
    • Vladimir Davydov's avatar
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov authored
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • Vladimir Davydov's avatar
      slab: get_online_mems for kmem_cache_{create,destroy,shrink} · 03afc0e2
      Vladimir Davydov authored
      When we create a sl[au]b cache, we allocate kmem_cache_node structures
      for each online NUMA node.  To handle nodes taken online/offline, we
      register memory hotplug notifier and allocate/free kmem_cache_node
      corresponding to the node that changes its state for each kmem cache.
      
      To synchronize between the two paths we hold the slab_mutex during both
      the cache creationg/destruction path and while tuning per-node parts of
      kmem caches in memory hotplug handler, but that's not quite right,
      because it does not guarantee that a newly created cache will have all
      kmem_cache_nodes initialized in case it races with memory hotplug.  For
      instance, in case of slub:
      
          CPU0                            CPU1
          ----                            ----
          kmem_cache_create:              online_pages:
           __kmem_cache_create:            slab_memory_callback:
                                            slab_mem_going_online_callback:
                                             lock slab_mutex
                                             for each slab_caches list entry
                                                 allocate kmem_cache node
                                             unlock slab_mutex
            lock slab_mutex
            init_kmem_cache_nodes:
             for_each_node_state(node, N_NORMAL_MEMORY)
                 allocate kmem_cache node
            add kmem_cache to slab_caches list
            unlock slab_mutex
                                          online_pages (continued):
                                           node_states_set_node
      
      As a result we'll get a kmem cache with not all kmem_cache_nodes
      allocated.
      
      To avoid issues like that we should hold get/put_online_mems() during
      the whole kmem cache creation/destruction/shrink paths, just like we
      deal with cpu hotplug.  This patch does the trick.
      
      Note, that after it's applied, there is no need in taking the slab_mutex
      for kmem_cache_shrink any more, so it is removed from there.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03afc0e2
    • Vladimir Davydov's avatar
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov authored
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • Vladimir Davydov's avatar
      mm: get rid of __GFP_KMEMCG · 52383431
      Vladimir Davydov authored
      Currently to allocate a page that should be charged to kmemcg (e.g.
      threadinfo), we pass __GFP_KMEMCG flag to the page allocator.  The page
      allocated is then to be freed by free_memcg_kmem_pages.  Apart from
      looking asymmetrical, this also requires intrusion to the general
      allocation path.  So let's introduce separate functions that will
      alloc/free pages charged to kmemcg.
      
      The new functions are called alloc_kmem_pages and free_kmem_pages.  They
      should be used when the caller actually would like to use kmalloc, but
      has to fall back to the page allocator for the allocation is large.
      They only differ from alloc_pages and free_pages in that besides
      allocating or freeing pages they also charge them to the kmem resource
      counter of the current memory cgroup.
      
      [sfr@canb.auug.org.au: export kmalloc_order() to modules]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52383431
    • Vladimir Davydov's avatar
      sl[au]b: charge slabs to kmemcg explicitly · 5dfb4175
      Vladimir Davydov authored
      We have only a few places where we actually want to charge kmem so
      instead of intruding into the general page allocation path with
      __GFP_KMEMCG it's better to explictly charge kmem there.  All kmem
      charges will be easier to follow that way.
      
      This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
      from memcg caches' allocflags.  Instead it makes slab allocation path
      call memcg_charge_kmem directly getting memcg to charge from the cache's
      memcg params.
      
      This also eliminates any possibility of misaccounting an allocation
      going from one memcg's cache to another memcg, because now we always
      charge slabs against the memcg the cache belongs to.  That's why this
      patch removes the big comment to memcg_kmem_get_cache.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dfb4175
    • Dave Hansen's avatar
      mm: slub: fix ALLOC_SLOWPATH stat · 8eae1492
      Dave Hansen authored
      There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
      got bumped in that exit path.  Now there are two, and a bunch of gotos.
      ALLOC_SLOWPATH can now get set more than once during a single call to
      __slab_alloc() which is pretty bogus.  Here's the sequence:
      
      1. Enter __slab_alloc(), fall through all the way to the
         stat(s, ALLOC_SLOWPATH);
      2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
         new_slab (goto #1)
      3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
         (goto #2)
      4. Fall through in the same path we did before all the way to
         stat(s, ALLOC_SLOWPATH)
      5. bump ALLOC_REFILL stat, then return
      
      Doing this is obviously bogus.  It keeps us from being able to
      accurately compare ALLOC_SLOWPATH vs.  ALLOC_FASTPATH.  It also means
      that the total number of allocs always exceeds the total number of
      frees.
      
      This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
      place that __slab_alloc() is.  This makes it much less likely that
      ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
      __slab_alloc().
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8eae1492