Skip to content
  • Joonsoo Kim's avatar
    mm/slub: optimize alloc/free fastpath by removing preemption on/off · 9aabf810
    Joonsoo Kim authored
    
    
    We had to insert a preempt enable/disable in the fastpath a while ago in
    order to guarantee that tid and kmem_cache_cpu are retrieved on the same
    cpu.  It is the problem only for CONFIG_PREEMPT in which scheduler can
    move the process to other cpu during retrieving data.
    
    Now, I reach the solution to remove preempt enable/disable in the
    fastpath.  If tid is matched with kmem_cache_cpu's tid after tid and
    kmem_cache_cpu are retrieved by separate this_cpu operation, it means
    that they are retrieved on the same cpu.  If not matched, we just have
    to retry it.
    
    With this guarantee, preemption enable/disable isn't need at all even if
    CONFIG_PREEMPT, so this patch removes it.
    
    I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in
    CONFIG_PREEMPT.  (14.821 ns -> 14.049 ns)
    
    Below is the result of Christoph's slab_test reported by Jesper Dangaard
    Brouer.
    
    * Before
    
     Single thread testing
     =====================
     1. Kmalloc: Repeatedly allocate then free test
     10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
     10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
     10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
     10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
     10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
     10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
     10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
     10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
     10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
     10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
     10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
     10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
     2. Kmalloc: alloc/free test
     10000 times kmalloc(8)/kfree -> 68 cycles
     10000 times kmalloc(16)/kfree -> 68 cycles
     10000 times kmalloc(32)/kfree -> 69 cycles
     10000 times kmalloc(64)/kfree -> 68 cycles
     10000 times kmalloc(128)/kfree -> 68 cycles
     10000 times kmalloc(256)/kfree -> 68 cycles
     10000 times kmalloc(512)/kfree -> 74 cycles
     10000 times kmalloc(1024)/kfree -> 75 cycles
     10000 times kmalloc(2048)/kfree -> 74 cycles
     10000 times kmalloc(4096)/kfree -> 74 cycles
     10000 times kmalloc(8192)/kfree -> 75 cycles
     10000 times kmalloc(16384)/kfree -> 510 cycles
    
    * After
    
     Single thread testing
     =====================
     1. Kmalloc: Repeatedly allocate then free test
     10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
     10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
     10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
     10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
     10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
     10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
     10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
     10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
     10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
     10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
     10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
     10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
     2. Kmalloc: alloc/free test
     10000 times kmalloc(8)/kfree -> 65 cycles
     10000 times kmalloc(16)/kfree -> 66 cycles
     10000 times kmalloc(32)/kfree -> 65 cycles
     10000 times kmalloc(64)/kfree -> 66 cycles
     10000 times kmalloc(128)/kfree -> 66 cycles
     10000 times kmalloc(256)/kfree -> 71 cycles
     10000 times kmalloc(512)/kfree -> 72 cycles
     10000 times kmalloc(1024)/kfree -> 71 cycles
     10000 times kmalloc(2048)/kfree -> 71 cycles
     10000 times kmalloc(4096)/kfree -> 71 cycles
     10000 times kmalloc(8192)/kfree -> 65 cycles
     10000 times kmalloc(16384)/kfree -> 511 cycles
    
    Most of the results are better than before.
    
    Note that this change slightly worses performance in !CONFIG_PREEMPT,
    roughly 0.3%.  Implementing each case separately would help performance,
    but, since it's so marginal, I didn't do that.  This would help
    maintanance since we have same code for all cases.
    
    Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    Tested-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
    Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    9aabf810