1. 03 Jul, 2018 1 commit
  2. 03 Jun, 2018 1 commit
    • Daniel Borkmann's avatar
      bpf: avoid retpoline for lookup/update/delete calls on maps · 09772d92
      Daniel Borkmann authored
      While some of the BPF map lookup helpers provide a ->map_gen_lookup()
      callback for inlining the map lookup altogether it is not available
      for every map, so the remaining ones have to call bpf_map_lookup_elem()
      helper which does a dispatch to map->ops->map_lookup_elem(). In
      times of retpolines, this will control and trap speculative execution
      rather than letting it do its work for the indirect call and will
      therefore cause a slowdown. Likewise, bpf_map_update_elem() and
      bpf_map_delete_elem() do not have an inlined version and need to call
      into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
      handlers.
      
      Before:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#232656
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
         16: (95) exit                                 helper
      
      After:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#233328
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
         16: (95) exit
      
      In all three lookup/update/delete cases however we can use the actual
      address of the map callback directly if we find that there's only a
      single path with a map pointer leading to the helper call, meaning
      when the map pointer has not been poisoned from verifier side.
      Example code can be seen above for the delete case.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: 's avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      09772d92
  3. 14 Jan, 2018 3 commits
  4. 12 Dec, 2017 1 commit
  5. 20 Oct, 2017 1 commit
  6. 19 Oct, 2017 1 commit
  7. 01 Sep, 2017 2 commits
    • Martin KaFai Lau's avatar
      bpf: Only set node->ref = 1 if it has not been set · bb9b9f88
      Martin KaFai Lau authored
      This patch writes 'node->ref = 1' only if node->ref is 0.
      The number of lookups/s for a ~1M entries LRU map increased by
      ~30% (260097 to 343313).
      
      Other writes on 'node->ref = 0' is not changed.  In those cases, the
      same cache line has to be changed anyway.
      
      First column: Size of the LRU hash
      Second column: Number of lookups/s
      
      Before:
      > echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
      1048577: 260097
      
      After:
      > echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
      1048577: 343313
      Signed-off-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      bb9b9f88
    • Martin KaFai Lau's avatar
      bpf: Inline LRU map lookup · cc555421
      Martin KaFai Lau authored
      Inline the lru map lookup to save the cost in making calls to
      bpf_map_lookup_elem() and htab_lru_map_lookup_elem().
      
      Different LRU hash size is tested.  The benefit diminishes when
      the cache miss starts to dominate in the bigger LRU hash.
      Considering the change is simple, it is still worth to optimize.
      
      First column: Size of the LRU hash
      Second column: Number of lookups/s
      
      Before:
      > for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
      513: 1132020
      1025: 1056826
      2049: 1007024
      4097: 853298
      8193: 742723
      16385: 712600
      32769: 688142
      65537: 677028
      131073: 619437
      262145: 498770
      524289: 316695
      1048577: 260038
      
      After:
      > for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
      513: 1221851
      1025: 1144695
      2049: 1049902
      4097: 884460
      8193: 773731
      16385: 729673
      32769: 721989
      65537: 715530
      131073: 671665
      262145: 516987
      524289: 321125
      1048577: 260048
      Signed-off-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      cc555421
  8. 22 Aug, 2017 2 commits
    • Daniel Borkmann's avatar
      bpf: fix map value attribute for hash of maps · 33ba43ed
      Daniel Borkmann authored
      Currently, iproute2's BPF ELF loader works fine with array of maps
      when retrieving the fd from a pinned node and doing a selfcheck
      against the provided map attributes from the object file, but we
      fail to do the same for hash of maps and thus refuse to get the
      map from pinned node.
      
      Reason is that when allocating hash of maps, fd_htab_map_alloc() will
      set the value size to sizeof(void *), and any user space map creation
      requests are forced to set 4 bytes as value size. Thus, selfcheck
      will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
      object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
      returns the value size used to create the map.
      
      Fix it by handling it the same way as we do for array of maps, which
      means that we leave value size at 4 bytes and in the allocation phase
      round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
      in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
      calling into htab_map_update_elem() with the pointer of the map
      pointer as value. Unlike array of maps where we just xchg(), we're
      using the generic htab_map_update_elem() callback also used from helper
      calls, which published the key/value already on return, so we need
      to ensure to memcpy() the right size.
      
      Fixes: bcc6b1b7 ("bpf: Add hash of maps support")
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      33ba43ed
    • Daniel Borkmann's avatar
      bpf: fix map value attribute for hash of maps · cd36c3a2
      Daniel Borkmann authored
      Currently, iproute2's BPF ELF loader works fine with array of maps
      when retrieving the fd from a pinned node and doing a selfcheck
      against the provided map attributes from the object file, but we
      fail to do the same for hash of maps and thus refuse to get the
      map from pinned node.
      
      Reason is that when allocating hash of maps, fd_htab_map_alloc() will
      set the value size to sizeof(void *), and any user space map creation
      requests are forced to set 4 bytes as value size. Thus, selfcheck
      will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
      object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
      returns the value size used to create the map.
      
      Fix it by handling it the same way as we do for array of maps, which
      means that we leave value size at 4 bytes and in the allocation phase
      round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
      in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
      calling into htab_map_update_elem() with the pointer of the map
      pointer as value. Unlike array of maps where we just xchg(), we're
      using the generic htab_map_update_elem() callback also used from helper
      calls, which published the key/value already on return, so we need
      to ensure to memcpy() the right size.
      
      Fixes: bcc6b1b7 ("bpf: Add hash of maps support")
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      cd36c3a2
  9. 20 Aug, 2017 2 commits
    • Daniel Borkmann's avatar
      bpf: inline map in map lookup functions for array and htab · 7b0c2a05
      Daniel Borkmann authored
      Avoid two successive functions calls for the map in map lookup, first
      is the bpf_map_lookup_elem() helper call, and second the callback via
      map->ops->map_lookup_elem() to get to the map in map implementation.
      Implementation inlines array and htab flavor for map in map lookups.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      7b0c2a05
    • Martin KaFai Lau's avatar
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau authored
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      96eabe7a
  10. 29 Jun, 2017 1 commit
  11. 25 Apr, 2017 1 commit
  12. 11 Apr, 2017 1 commit
  13. 22 Mar, 2017 2 commits
    • Martin KaFai Lau's avatar
      bpf: Add hash of maps support · bcc6b1b7
      Martin KaFai Lau authored
      This patch adds hash of maps support (hashmap->bpf_map).
      BPF_MAP_TYPE_HASH_OF_MAPS is added.
      
      A map-in-map contains a pointer to another map and lets call
      this pointer 'inner_map_ptr'.
      
      Notes on deleting inner_map_ptr from a hash map:
      
      1. For BPF_F_NO_PREALLOC map-in-map, when deleting
         an inner_map_ptr, the htab_elem itself will go through
         a rcu grace period and the inner_map_ptr resides
         in the htab_elem.
      
      2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
         when deleting an inner_map_ptr, the htab_elem may
         get reused immediately.  This situation is similar
         to the existing prealloc-ated use cases.
      
         However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
         inner_map->ops->map_free(inner_map) which will go
         through a rcu grace period (i.e. all bpf_map's map_free
         currently goes through a rcu grace period).  Hence,
         the inner_map_ptr is still safe for the rcu reader side.
      
      This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
      check_map_prealloc() in the verifier.  preallocation is a
      must for BPF_PROG_TYPE_PERF_EVENT.  Hence, even we don't expect
      heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
      is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
      map-in-map first.
      Signed-off-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      bcc6b1b7
    • Alexei Starovoitov's avatar
      bpf: fix hashmap extra_elems logic · 8c290e60
      Alexei Starovoitov authored
      In both kmalloc and prealloc mode the bpf_map_update_elem() is using
      per-cpu extra_elems to do atomic update when the map is full.
      There are two issues with it. The logic can be misused, since it allows
      max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
      at map creation time can fail percpu alloc for large map values with a warn:
      WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
      illegal size (32824) or align (8) for percpu allocation
      
      The fixes for both of these issues are different for kmalloc and prealloc modes.
      For prealloc mode allocate extra num_possible_cpus elements and store
      their pointers into extra_elems array instead of actual elements.
      Hence we can use these hidden(spare) elements not only when the map is full
      but during bpf_map_update_elem() that replaces existing element too.
      That also improves performance, since pcpu_freelist_pop/push is avoided.
      Unfortunately this approach cannot be used for kmalloc mode which needs
      to kfree elements after rcu grace period. Therefore switch it back to normal
      kmalloc even when full and old element exists like it was prior to
      commit 6c905981 ("bpf: pre-allocate hash map elements").
      
      Add tests to check for over max_entries and large map values.
      Reported-by: 's avatarDave Jones <davej@codemonkey.org.uk>
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      8c290e60
  14. 17 Mar, 2017 1 commit
  15. 09 Mar, 2017 2 commits
  16. 17 Feb, 2017 1 commit
  17. 18 Jan, 2017 1 commit
    • Daniel Borkmann's avatar
      bpf: don't trigger OOM killer under pressure with map alloc · d407bd25
      Daniel Borkmann authored
      This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
      that are to be used for map allocations. Using kmalloc() for very large
      allocations can cause excessive work within the page allocator, so i) fall
      back earlier to vmalloc() when the attempt is considered costly anyway,
      and even more importantly ii) don't trigger OOM killer with any of the
      allocators.
      
      Since this is based on a user space request, for example, when creating
      maps with element pre-allocation, we really want such requests to fail
      instead of killing other user space processes.
      
      Also, don't spam the kernel log with warnings should any of the allocations
      fail under pressure. Given that, we can make backend selection in
      bpf_map_area_alloc() generic, and convert all maps over to use this API
      for spots with potentially large allocation requests.
      
      Note, replacing the one kmalloc_array() is fine as overflow checks happen
      earlier in htab_map_alloc(), since it must also protect the multiplication
      for vmalloc() should kmalloc_array() fail.
      Signed-off-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      d407bd25
  18. 11 Jan, 2017 1 commit
  19. 15 Nov, 2016 3 commits
  20. 07 Nov, 2016 1 commit
  21. 07 Aug, 2016 1 commit
    • Alexei Starovoitov's avatar
      bpf: restore behavior of bpf_map_update_elem · a6ed3ea6
      Alexei Starovoitov authored
      The introduction of pre-allocated hash elements inadvertently broke
      the behavior of bpf hash maps where users expected to call
      bpf_map_update_elem() without considering that the map can be full.
      Some programs do:
      old_value = bpf_map_lookup_elem(map, key);
      if (old_value) {
        ... prepare new_value on stack ...
        bpf_map_update_elem(map, key, new_value);
      }
      Before pre-alloc the update() for existing element would work even
      in 'map full' condition. Restore this behavior.
      
      The above program could have updated old_value in place instead of
      update() which would be faster and most programs use that approach,
      but sometimes the values are large and the programs use update()
      helper to do atomic replacement of the element.
      Note we cannot simply update element's value in-place like percpu
      hash map does and have to allocate extra num_possible_cpu elements
      and use this extra reserve when the map is full.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      a6ed3ea6
  22. 08 Mar, 2016 1 commit
    • Alexei Starovoitov's avatar
      bpf: pre-allocate hash map elements · 6c905981
      Alexei Starovoitov authored
      If kprobe is placed on spin_unlock then calling kmalloc/kfree from
      bpf programs is not safe, since the following dead lock is possible:
      kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
      bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
      and deadlocks.
      
      The following solutions were considered and some implemented, but
      eventually discarded
      - kmem_cache_create for every map
      - add recursion check to slow-path of slub
      - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
      - kmalloc via irq_work
      
      At the end pre-allocation of all map elements turned out to be the simplest
      solution and since the user is charged upfront for all the memory, such
      pre-allocation doesn't affect the user space visible behavior.
      
      Since it's impossible to tell whether kprobe is triggered in a safe
      location from kmalloc point of view, use pre-allocation by default
      and introduce new BPF_F_NO_PREALLOC flag.
      
      While testing of per-cpu hash maps it was discovered
      that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
      fails to allocate memory even when 90% of it is free.
      The pre-allocation of per-cpu hash elements solves this problem as well.
      
      Turned out that bpf_map_update() quickly followed by
      bpf_map_lookup()+bpf_map_delete() is very common pattern used
      in many of iovisor/bcc/tools, so there is additional benefit of
      pre-allocation, since such use cases are must faster.
      
      Since all hash map elements are now pre-allocated we can remove
      atomic increment of htab->count and save few more cycles.
      
      Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
      large malloc/free done by users who don't have sufficient limits.
      
      Pre-allocation is done with vmalloc and alloc/free is done
      via percpu_freelist. Here are performance numbers for different
      pre-allocation algorithms that were implemented, but discarded
      in favor of percpu_freelist:
      
      1 cpu:
      pcpu_ida	2.1M
      pcpu_ida nolock	2.3M
      bt		2.4M
      kmalloc		1.8M
      hlist+spinlock	2.3M
      pcpu_freelist	2.6M
      
      4 cpu:
      pcpu_ida	1.5M
      pcpu_ida nolock	1.8M
      bt w/smp_align	1.7M
      bt no/smp_align	1.1M
      kmalloc		0.7M
      hlist+spinlock	0.2M
      pcpu_freelist	2.0M
      
      8 cpu:
      pcpu_ida	0.7M
      bt w/smp_align	0.8M
      kmalloc		0.4M
      pcpu_freelist	1.5M
      
      32 cpu:
      kmalloc		0.13M
      pcpu_freelist	0.49M
      
      pcpu_ida nolock is a modified percpu_ida algorithm without
      percpu_ida_cpu locks and without cross-cpu tag stealing.
      It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
      
      bt is a variant of block/blk-mq-tag.c simlified and customized
      for bpf use case. bt w/smp_align is using cache line for every 'long'
      (similar to blk-mq-tag). bt no/smp_align allocates 'long'
      bitmasks continuously to save memory. It's comparable to percpu_ida
      and in some cases faster, but slower than percpu_freelist
      
      hlist+spinlock is the simplest free list with single spinlock.
      As expeceted it has very bad scaling in SMP.
      
      kmalloc is existing implementation which is still available via
      BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
      in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
      but saves memory, so in cases where map->max_entries can be large
      and number of map update/delete per second is low, it may make
      sense to use it.
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      6c905981
  23. 19 Feb, 2016 1 commit
  24. 06 Feb, 2016 2 commits
    • Alexei Starovoitov's avatar
      bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33
      Alexei Starovoitov authored
      The functions bpf_map_lookup_elem(map, key, value) and
      bpf_map_update_elem(map, key, value, flags) need to get/set
      values from all-cpus for per-cpu hash and array maps,
      so that user space can aggregate/update them as necessary.
      
      Example of single counter aggregation in user space:
        unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
        long values[nr_cpus];
        long value = 0;
      
        bpf_lookup_elem(fd, key, values);
        for (i = 0; i < nr_cpus; i++)
          value += values[i];
      
      The user space must provide round_up(value_size, 8) * nr_cpus
      array to get/set values, since kernel will use 'long' copy
      of per-cpu values to try to copy good counters atomically.
      It's a best-effort, since bpf programs and user space are racing
      to access the same memory.
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      15a07b33
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_MAP_TYPE_PERCPU_HASH map · 824bd0ce
      Alexei Starovoitov authored
      Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
      accurate counters without need to use BPF_XADD instruction which turned
      out to be too costly for high-performance network monitoring.
      In the typical use case the 'key' is the flow tuple or other long
      living object that sees a lot of events per second.
      
      bpf_map_lookup_elem() returns per-cpu area.
      Example:
      struct {
        u32 packets;
        u32 bytes;
      } * ptr = bpf_map_lookup_elem(&map, &key);
      /* ptr points to this_cpu area of the value, so the following
       * increments will not collide with other cpus
       */
      ptr->packets ++;
      ptr->bytes += skb->len;
      
      bpf_update_elem() atomically creates a new element where all per-cpu
      values are zero initialized and this_cpu value is populated with
      given 'value'.
      Note that non-per-cpu hash map always allocates new element
      and then deletes old after rcu grace period to maintain atomicity
      of update. Per-cpu hash map updates element values in-place.
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      824bd0ce
  25. 29 Dec, 2015 3 commits
  26. 03 Dec, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: fix allocation warnings in bpf maps and integer overflow · 01b3f521
      Alexei Starovoitov authored
      For large map->value_size the user space can trigger memory allocation warnings like:
      WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
      __alloc_pages_nodemask+0x695/0x14e0()
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
       [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
       [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
       [<     inline     >] __alloc_pages_slowpath mm/page_alloc.c:2989
       [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
       [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
       [<     inline     >] alloc_pages include/linux/gfp.h:451
       [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
       [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
       [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
       [<     inline     >] kmalloc_large include/linux/slab.h:390
       [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
       [<     inline     >] kmalloc include/linux/slab.h:463
       [<     inline     >] map_update_elem kernel/bpf/syscall.c:288
       [<     inline     >] SYSC_bpf kernel/bpf/syscall.c:744
      
      To avoid never succeeding kmalloc with order >= MAX_ORDER check that
      elem->value_size and computed elem_size are within limits for both hash and
      array type maps.
      Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
      Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
      so keep those kmalloc-s as-is.
      
      Large value_size can cause integer overflows in elem_size and map.pages
      formulas, so check for that as well.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Reported-by: 's avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: 's avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      01b3f521
  27. 02 Nov, 2015 1 commit
    • Yang Shi's avatar
      bpf: convert hashtab lock to raw lock · ac00881f
      Yang Shi authored
      When running bpf samples on rt kernel, it reports the below warning:
      
      BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
      in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
      Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228
      
      CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
      Hardware name: Freescale Layerscape 2085a RDB Board (DT)
      Call trace:
      [<ffff80000008a5b0>] dump_backtrace+0x0/0x128
      [<ffff80000008a6f8>] show_stack+0x20/0x30
      [<ffff8000007da90c>] dump_stack+0x7c/0xa0
      [<ffff8000000e4830>] ___might_sleep+0x188/0x1a0
      [<ffff8000007e2200>] rt_spin_lock+0x28/0x40
      [<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320
      [<ffff80000018c718>] bpf_map_update_elem+0x40/0x58
      [<ffff800000187658>] __bpf_prog_run+0xd48/0x1640
      [<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100
      [<ffff80000017db58>] kprobe_perf_func+0x30/0x228
      [<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58
      [<ffff8000007e399c>] kprobe_handler+0x114/0x250
      [<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30
      [<ffff800000085b80>] brk_handler+0x88/0x98
      [<ffff8000000822f0>] do_debug_exception+0x50/0xb8
      Exception stack(0xffff808349687460 to 0xffff808349687580)
      7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
      7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
      74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
      74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
      74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
      7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
      7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
      7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
      7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77
      
      Convert hashtab lock to raw lock to avoid such warning.
      Signed-off-by: 's avatarYang Shi <yang.shi@linaro.org>
      Acked-by: 's avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      ac00881f
  28. 13 Oct, 2015 1 commit
    • Alexei Starovoitov's avatar
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov authored
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: 's avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      aaac3ba9