linux_dsm_epyc7002/mm
Christoph Lameter dfb4f09609 SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
  Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
  checks.

Effect on allocations:

Cachelines touched before this patch:

	Write:	page cache struct and first cacheline of object

Cachelines touched after this patch:

	Write:	kmem_cache_cpu cacheline and first cacheline of object
	Read: page cache struct (but see later patch that avoids touching
		that cacheline)

The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.

Effect on freeing:

Cachelines touched before this patch:

	Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

  Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
  Write(to other):	page struct and first cacheline of object

  Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
  			avoids touching the page struct on free)
  Read(to other):	cpu local kmem_cache_cpu struct to verify its not
  			the cpu slab.

Summary:

Pro:
	- Distinct cachelines so that concurrent remote frees and local
	  allocs on a cpuslab can occur without cacheline bouncing.
	- Avoids potential bouncing cachelines because of neighboring
	  per cpu pointer updates in kmem_cache's cpu_slab structure since
	  it now grows to a cacheline (Therefore remove the comment
	  that talks about that concern).

Cons:
	- Freeing objects now requires the reading of one additional
	  cacheline. That can be mitigated for some cases by the following
	  patches but its not possible to completely eliminate these
	  references.

	- Memory usage grows slightly.

	The size of each per cpu object is blown up from one word
	(pointing to the page_struct) to one cacheline with various data.
	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
	NR_SLABS is 100 and a cache line size of 128 then we have just
	increased SLAB metadata requirements by 12.8k per cpu.
	(Another later patch reduces these requirements)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
..
allocpercpu.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
backing-dev.c remove mm/backing-dev.c:congestion_wait_interruptible() 2007-07-16 09:05:52 -07:00
bootmem.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
bounce.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
fadvise.c [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path 2006-12-08 08:28:43 -08:00
filemap_xip.c mm: write iovec cleanup 2007-10-16 09:42:54 -07:00
filemap.c fs: remove some AOP_TRUNCATED_PAGE 2007-10-16 09:42:58 -07:00
fremap.c fix VM_CAN_NONLINEAR check in sys_remap_file_pages 2007-10-08 12:58:14 -07:00
highmem.c Create the ZONE_MOVABLE zone 2007-07-17 10:22:59 -07:00
hugetlb.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
internal.h Make page->private usable in compound pages 2007-05-07 12:12:53 -07:00
Kconfig vmemmap: generify initialisation via helpers 2007-10-16 09:42:51 -07:00
madvise.c speed up madvise_need_mmap_write() usage 2007-07-16 09:05:36 -07:00
Makefile Generic Virtual Memmap support for SPARSEMEM 2007-10-16 09:42:51 -07:00
memory_hotplug.c Memoryless nodes: introduce mask of nodes with memory 2007-10-16 09:42:58 -07:00
memory.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mempolicy.c memoryless nodes: fixup uses of node_online_map in generic code 2007-10-16 09:42:59 -07:00
mempool.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
migrate.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITY 2007-07-16 09:05:37 -07:00
mmap.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
mmzone.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
mprotect.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mremap.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
msync.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
nommu.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
oom_kill.c Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly 2007-10-16 09:42:58 -07:00
page_alloc.c mm/page_alloc.c: make code static 2007-10-16 09:43:01 -07:00
page_io.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
page-writeback.c memoryless nodes: fixup uses of node_online_map in generic code 2007-10-16 09:42:59 -07:00
pdflush.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
quicklist.c Quicklists for page table pages 2007-05-07 12:12:54 -07:00
readahead.c mm: buffered write cleanup 2007-10-16 09:42:54 -07:00
rmap.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
shmem_acl.c [PATCH] Fix typos in mm/shmem_acl.c 2006-10-11 11:14:23 -07:00
shmem.c Group short-lived and reclaimable kernel allocations 2007-10-16 09:43:00 -07:00
slab.c Group short-lived and reclaimable kernel allocations 2007-10-16 09:43:00 -07:00
slob.c Slab allocators: fail if ksize is called with a NULL parameter 2007-10-16 09:42:53 -07:00
slub.c SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab 2007-10-16 09:43:01 -07:00
sparse-vmemmap.c vmemmap: generify initialisation via helpers 2007-10-16 09:42:51 -07:00
sparse.c Fix corruption of memmap on IA64 SPARSEMEM when mem_section is not a power of 2 2007-10-16 09:43:00 -07:00
swap_state.c mm: clarify __add_to_swap_cache locking 2007-10-16 09:42:53 -07:00
swap.c mm: use pagevec to rotate reclaimable page 2007-10-16 09:42:54 -07:00
swapfile.c Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION 2007-07-29 16:45:38 -07:00
thrash.c Bug in mm/thrash.c function grab_swap_token() 2007-05-11 08:29:32 -07:00
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c mm: merge populate and nopage into fault (fixes nonlinear) 2007-07-19 10:04:41 -07:00
util.c Slab allocators: fail if ksize is called with a NULL parameter 2007-10-16 09:42:53 -07:00
vmalloc.c Categorize GFP flags 2007-10-16 09:42:59 -07:00
vmscan.c make swappiness safer to use 2007-10-16 09:42:59 -07:00
vmstat.c Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo 2007-10-16 09:43:00 -07:00