linux_dsm_epyc7002/arch/unicore32/include/asm
Clement Courbet 0ade34c370 lib: optimize cpumask_next_and()
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which is
currently implemented as a lookup join (find a nonzero bit on the lhs,
lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join).  Also add generic bitmap benchmarks in the new `test_find_bit`
module for new function (see `find_next_and_bit` in [2] and [3] below).

For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
faster with a geometric mean of 2.1 on 32 CPUs [1].  No impact on memory
usage.  Note that on Arm, the new pure-C implementation still outperforms
the old one that uses a mix of C and asm (`find_next_bit`) [3].

[1] Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
    for (int n = -1; n <= nr_cpu_ids; ++n) {
      asm volatile("" : "+rm"(src1p)); // prevent any optimization
      asm volatile("" : "+rm"(src2p));
      unsigned long result = cpumask_next_and(n, src1p, src2p);
      asm volatile("" : "+rm"(result));
    }
  }
```

Results:
pattern1    pattern2     time_before/time_after
0x0000ffff  0x0000ffff   1.65
0x0000ffff  0x00005555   2.24
0x0000ffff  0x00001111   2.94
0x0000ffff  0x00000000   14.0
0x00005555  0x0000ffff   1.67
0x00005555  0x00005555   1.71
0x00005555  0x00001111   1.90
0x00005555  0x00000000   6.58
0x00001111  0x0000ffff   1.46
0x00001111  0x00005555   1.49
0x00001111  0x00001111   1.45
0x00001111  0x00000000   3.10
0x00000000  0x0000ffff   1.18
0x00000000  0x00005555   1.18
0x00000000  0x00001111   1.17
0x00000000  0x00000000   1.25
-----------------------------
               geo.mean  2.06

[2] test_find_next_bit, X86 (skylake)

 [ 3913.477422] Start testing find_bit() with random-filled bitmap
 [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
 [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
 [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
 [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
 [ 3913.480216] Start testing find_next_and_bit() with random-filled
 bitmap
 [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
 [ 3913.481075] Start testing find_bit() with sparse bitmap
 [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
 [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
 [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
 [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
 [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
 [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations

[3] test_find_next_bit, arm (v7 odroid XU3).

[  267.206928] Start testing find_bit() with random-filled bitmap
[  267.214752] find_next_bit: 4474 cycles, 16419 iterations
[  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
[  267.229294] find_last_bit: 4209 cycles, 16419 iterations
[  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
[  267.286265] Start testing find_next_and_bit() with random-filled
bitmap
[  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
[  267.309422] Start testing find_bit() with sparse bitmap
[  267.316054] find_next_bit: 191 cycles, 66 iterations
[  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
[  267.329803] find_last_bit: 84 cycles, 66 iterations
[  267.336169] find_first_bit: 4118 cycles, 66 iterations
[  267.342627] Start testing find_next_and_bit() with sparse bitmap
[  267.356919] find_next_and_bit: 91 cycles, 1 iterations

[courbet@google.com: v6]
  Link: http://lkml.kernel.org/r/20171129095715.23430-1-courbet@google.com
[geert@linux-m68k.org: m68k/bitops: always include <asm-generic/bitops/find.h>]
  Link: http://lkml.kernel.org/r/1512556816-28627-1-git-send-email-geert@linux-m68k.org
Link: http://lkml.kernel.org/r/20171128131334.23491-1-courbet@google.com
Signed-off-by: Clement Courbet <courbet@google.com>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yury Norov <ynorov@caviumnetworks.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:44 -08:00
..
assembler.h unicore32 additional architecture files: low-level lib: misc 2011-03-17 09:19:13 +08:00
barrier.h arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h 2014-01-12 10:37:15 +01:00
bitops.h lib: optimize cpumask_next_and() 2018-02-06 18:32:44 -08:00
bug.h UniCore32-bugfix: Remove definitions in asm/bug.h to solve difference between native and cross compiler 2012-11-09 17:30:09 +08:00
cache.h unicore32 core architecture: mm related: generic codes 2011-03-17 09:19:08 +08:00
cacheflush.h unicore32: make dma_cache_sync a no-op 2017-10-19 16:37:36 +02:00
checksum.h ipv4: Update parameters for csum_tcpudp_magic to their original types 2016-03-13 23:55:13 -04:00
cmpxchg.h UniCore32-bugfix: fix mismatch return value of __xchg_bad_pointer 2012-11-09 17:30:09 +08:00
cpu-single.h unicore32 core architecture: processor and system headers 2011-03-17 09:19:06 +08:00
cputype.h unicore32 core architecture: processor and system headers 2011-03-17 09:19:06 +08:00
delay.h unicore32 additional architecture files: low-level lib: misc 2011-03-17 09:19:13 +08:00
dma-mapping.h unicore32: use generic swiotlb_ops 2018-01-15 09:35:55 +01:00
dma.h unicore32 core architecture: mm related: consistent device DMA handling 2011-03-17 09:19:09 +08:00
elf.h
fpstate.h unicore32 additional architecture files: float point handling 2011-03-17 09:19:11 +08:00
fpu-ucf64.h unicore32 additional architecture files: float point handling 2011-03-17 09:19:11 +08:00
gpio.h unicore32 io: redefine __REG(x) and re-use readl/writel funcs 2011-03-17 09:19:19 +08:00
hwcap.h unicore32 core architecture: processor and system headers 2011-03-17 09:19:06 +08:00
hwdef-copro.h Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt] 2012-03-28 18:30:03 +01:00
io.h arch:unicore32:mm: add devmem_is_allowed() to support STRICT_DEVMEM 2014-06-20 08:22:40 +08:00
irq.h unicore32: remove unused lines in arch/unicore32/include/asm/irq.h 2011-03-17 09:19:17 +08:00
irqflags.h unicore32 core architecture: interrupts ang gpio handling 2011-03-17 09:19:10 +08:00
Kbuild arch: Remove clkdev.h asm-generic from Kbuild 2018-01-03 09:02:11 -08:00
linkage.h
memblock.h unicore32 core architecture: mm related: generic codes 2011-03-17 09:19:08 +08:00
memory.h mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h 2015-08-27 19:40:58 -04:00
mmu_context.h arch, mm: Allow arch_dup_mmap() to fail 2017-12-22 20:13:01 +01:00
mmu.h unicore32 core architecture: mm related: fault handling 2011-03-17 09:19:09 +08:00
page.h unicore32 core architecture: mm related: generic codes 2011-03-17 09:19:08 +08:00
pci.h unicore32/PCI: Use generic pci_mmap_resource_range() 2017-04-20 08:47:47 -05:00
pgalloc.h kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK 2017-11-15 18:21:04 -08:00
pgtable-hwdef.h unicore32: drop pte_file()-related helpers 2015-02-10 14:30:33 -08:00
pgtable.h arch, mm: convert all architectures to use 5level-fixup.h 2017-03-09 11:48:47 -08:00
processor.h locking/core: Provide common cpu_relax_yield() definition 2016-11-17 08:17:36 +01:00
ptrace.h arch/unicore32/include/asm/ptrace.h: add generic definition for profile_pc() 2014-06-20 08:22:38 +08:00
stacktrace.h unicore32 core architecture: process/thread related codes 2011-03-17 09:19:07 +08:00
string.h
suspend.h PM / Hibernate: Remove arch_prepare_suspend() 2011-05-24 23:35:55 +02:00
switch_to.h Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt] 2012-03-28 18:30:03 +01:00
thread_info.h Construct init thread stack in the linker script rather than by union 2018-01-09 23:21:02 +00:00
timex.h unicore32 core architecture: timer and time handling 2011-03-17 09:19:10 +08:00
tlb.h unicore32: rewrite arch-specific tlb.h to use asm-generic version 2011-03-17 09:19:21 +08:00
tlbflush.h unicore32 core architecture: mm related: consistent device DMA handling 2011-03-17 09:19:09 +08:00
traps.h unicore32 core architecture: low level entry and setup codes 2011-03-17 09:19:06 +08:00
uaccess.h unicore32: get rid of zeroing and switch to RAW_COPY_USER 2017-03-28 18:24:04 -04:00