Skip to content
  • Clement Courbet's avatar
    lib: optimize cpumask_next_and() · 0ade34c3
    Clement Courbet authored
    We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
    It's essentially a joined iteration in search for a non-zero bit, which is
    currently implemented as a lookup join (find a nonzero bit on the lhs,
    lookup the rhs to see if it's set there).
    
    Implement a direct join (find a nonzero bit on the incrementally built
    join).  Also add generic bitmap benchmarks in the new `test_find_bit`
    module for new function (see `find_next_and_bit` in [2] and [3] below).
    
    For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
    faster with a geometric mean of 2.1 on 32 CPUs [1].  No impact on memory
    usage.  Note that on Arm, the new pure-C implementation still outperforms
    the old one that uses a mix of C and asm (`find_next_bit`) [3].
    
    [1] Approximate benchmark code:
    
    ```
      unsigned long src1p[nr_cpumask_longs] = {pattern1};
      unsigned long src2p[nr_cpumask_longs] = {pattern2};
      for (/*a bunch of repetitions*/) {
        for (int n = -1; n <= nr_cpu_ids; ++n) {
    ...
    0ade34c3