linux_dsm_epyc7002/Documentation
Andrea Arcangeli 2c653d0ee2 ksm: introduce ksm_max_page_sharing per page deduplication limit
Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.

During the rmap walk each entry can take up to ~10usec to process
because of IPIs for the TLB flushing (both for the primary MMU and the
secondary MMUs with the MMU notifier).  With only 16GB of address space
shared in the same KSM page, that would amount to dozens of seconds of
kernel runtime.

A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec.  Just doing the
cond_resched() during the rmap walks is not enough, the list size must
have a limit too, otherwise the caller could get blocked in (schedule
friendly) kernel computations for seconds, unexpectedly.

There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in
the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
it may be inevitable to send lots of IPIs if each rmap_item->mm is
active on a different CPU and there are lots of CPUs.  Even if we ignore
the IPI delivery cost, we've still to walk the whole KSM rmap list, so
we can't allow millions or billions (ulimited) number of entries in the
KSM stable_node rmap_item lists.

The limit is enforced efficiently by adding a second dimension to the
stable rbtree.  So there are three types of stable_nodes: the regular
ones (identical as before, living in the first flat dimension of the
stable rbtree), the "chains" and the "dups".

Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if
each "dup" will be pointed by a different KSM page copy of that content.
This way the stable rbtree lookup computational complexity is unaffected
if compared to an unlimited max_sharing_limit.  It is still enforced
that there cannot be KSM page content duplicates in the stable rbtree
itself.

Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint
increase on 64bit archs.  The memory overhead of the per-KSM page
stable_tree and per virtual mapping rmap_item is unchanged.  Only after
the max_page_sharing limit hits, we need to allocate a stable_tree
"chain" and rb_replace() the "regular" stable_node with the newly
allocated stable_node "chain".  After that we simply add the "regular"
stable_node to the chain as a stable_node "dup" by linking hlist_dup in
the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
is converted to a stable_node "dup" living in the second dimension of
the stable rbtree.

During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).

When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes.  So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.

The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".

The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
is aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.

The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.

While the "regular" stable_nodes and the stable_node "dups" must wait
for their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty.  This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.

[akpm@linux-foundation.org: fix non-NUMA build]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Evgheni Dereveanchin <ederevea@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
..
ABI Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
accounting
acpi Merge branches 'acpi-button' and 'acpi-tools' 2017-05-22 20:29:06 +02:00
admin-guide mm: allow slab_nomerge to be set at build time 2017-07-06 16:24:31 -07:00
aoe
arm ARM: at91: Documentation: add armv7m families 2017-06-02 10:11:09 +02:00
arm64 arm64: documentation: document tagged pointer stack constraints 2017-05-09 17:43:18 +01:00
auxdisplay
backlight
blackfin
block block: remove bio_clone() and all references. 2017-06-18 12:40:59 -06:00
blockdev remove the mg_disk driver 2017-04-14 14:00:49 -06:00
bus-devices
cdrom cdrom: Make device operations read-only 2017-02-14 08:29:56 -07:00
cgroup-v1 Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2017-02-27 21:41:08 -08:00
cma
connector
console
core-api There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
cpu-freq cpufreq: intel_pstate: Document the current behavior and user interface 2017-05-14 02:06:03 +02:00
cpuidle
cris
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2017-07-05 12:22:23 -07:00
dev-tools There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
device-mapper - A major update for DM cache that reduces the latency for deciding 2017-05-03 10:31:20 -07:00
devicetree Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata 2017-07-06 09:41:58 -07:00
dmaengine dmaengine: Documentation: Fix typo in pxa_dma.txt 2016-11-14 08:14:24 +05:30
doc-guide kernel-doc: describe the `literal` syntax 2017-05-16 08:44:24 -03:00
driver-api There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
driver-model Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
early-userspace Documentation: Fix dead URLs to ftp.kernel.org 2017-03-29 15:46:06 -06:00
EDID drm: use .hword to represent 16-bit numbers 2017-03-30 10:15:19 +02:00
extcon extcon: Remove porting compatibility of swich class 2017-04-06 10:55:24 +09:00
fault-injection
fb docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
features powerpc updates for 4.12 part 1. 2017-05-05 11:36:44 -07:00
filesystems There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
firmware_class firmware: revamp firmware documentation 2017-01-11 09:42:59 +01:00
fmc
fpga fpga: Add scatterlist based programming 2017-02-10 15:20:44 +01:00
frv
gpio gpio: return NULL from gpiod_get_optional when GPIOLIB is disabled 2017-03-15 11:16:30 +01:00
gpu docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
hid Documentation: hid: fix path to input bus definitions 2017-03-13 17:15:19 -06:00
hwmon hwmon: (pmbus) move header file out of I2C realm 2017-06-11 17:08:19 -07:00
i2c i2c: i801: Add support for Intel Gemini Lake 2017-02-09 17:39:16 +01:00
ia64
ide
iio
infiniband IB/opa-vnic: Virtual Network Interface Controller (VNIC) documentation 2017-04-20 12:01:06 -04:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2017-05-26 16:45:13 -07:00
ioctl TEE driver infrastructure and OP-TEE drivers 2017-05-10 11:20:09 -07:00
isdn
kbuild Documentation, kbuild: fix typo "minimun" -> "minimum" 2017-05-18 10:49:44 -06:00
kdump Documentation: kdump: describe arm64 port 2017-04-05 18:32:32 +01:00
kernel-hacking There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
laptops platform/x86: thinkpad_acpi: Add support for X1 Yoga (2016) Tablet Mode 2016-12-13 09:29:06 -08:00
leds Documentaion: leds: leds-lp55xx.txt: Fix typos 2017-03-17 13:06:14 -06:00
lightnvm lightnvm: physical block device (pblk) target 2017-04-16 10:06:33 -06:00
livepatch livepatch: allow removal of a disabled patch 2017-03-08 09:38:43 +01:00
locking locking/ww_mutex/Documentation: Update the design document 2017-01-14 11:14:55 +01:00
m68k
md Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-05-03 10:05:38 -07:00
media Docs: Use kernel-figure in vidioc-g-selection.rst 2017-06-23 13:45:56 -06:00
memory-devices
metag
mic
mips
misc-devices Documentation: misc-devices: Add Documentation for pci-endpoint-test driver 2017-04-28 10:23:19 -05:00
mmc MMC core: 2017-05-02 17:34:32 -07:00
mn10300
mtd spi-nor: Add support for Intel SPI serial flash controller 2017-01-03 17:33:36 +00:00
namespaces
netlabel
networking Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
nfc
nios2
nvdimm
nvmem
parisc
PCI docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
pcmcia
perf perf: qcom: Add L3 cache PMU driver 2017-04-03 18:53:50 +01:00
phy
platform
power power supply and reset changes for the v4.13 series 2017-07-04 14:25:14 -07:00
powerpc powerpc/fadump: update documentation about crashkernel parameter reuse 2017-05-08 17:15:11 -07:00
pps Doc: clarify source of jitter in USB1.1, and USB2.0 2017-01-04 14:40:52 -07:00
process doc: Document suitability of IBM Verse for kernel development 2017-06-22 10:22:41 -06:00
pti
ptp
rapidio
RCU rcu: Remove debugfs tracing 2017-06-08 18:52:43 -07:00
s390 docs: add documentation for vfio-ccw 2017-03-31 12:55:11 +02:00
scheduler sched/deadline: Add documentation about GRUB reclaiming 2017-06-08 10:31:56 +02:00
scsi scsi: make asynchronous aborts mandatory 2017-04-06 13:07:33 -04:00
security docs: Fix some formatting issues in request-key.rst 2017-05-18 10:46:25 -06:00
serial tty: n_gsm: do not send/receive in ldisc close path 2017-06-03 18:48:52 +09:00
sh docs-rst: convert sh book to ReST 2017-05-16 08:44:18 -03:00
sound There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
sparc Documentation/sparc: Steps for sending break on sunhv console 2017-02-23 08:27:25 -08:00
sphinx Docs: clean up some DocBook loose ends 2017-06-23 14:17:38 -06:00
sphinx-static
spi spi: Document SPI slave controller support 2017-05-26 13:11:00 +01:00
sysctl Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning 2017-04-21 13:22:34 -04:00
target Documentation/target: add an example script to configure an iSCSI target 2017-05-01 22:21:35 -07:00
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2017-05-12 11:58:45 -07:00
timers rcu: Eliminate NOCBs CPU-state Kconfig options 2017-06-08 18:52:43 -07:00
trace Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
translations doc/kokr/howto: Only send regression fixes after -rc1 2017-06-22 10:25:22 -06:00
usb usb: gadget: add f_uac1 variant based on a new u_audio api 2017-06-19 09:22:47 +03:00
userspace-api doc: ReSTify no_new_privs.txt 2017-05-18 10:30:09 -06:00
virtual Second round of KVM/ARM Changes for v4.12. 2017-05-09 12:51:49 +02:00
vm ksm: introduce ksm_max_page_sharing per page deduplication limit 2017-07-06 16:24:31 -07:00
w1 w1: add documentation for w1_ds2438 2017-03-17 15:10:49 +09:00
watchdog iTCO_wdt: all versions count down twice 2017-05-19 10:42:11 +02:00
wimax
x86 x86/mce: Update bootlog description to reflect behavior on AMD 2017-06-14 07:32:10 +02:00
xtensa
.gitignore
00-INDEX Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbook 2017-05-18 11:03:08 -06:00
bcache.txt
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
cgroup-v2.txt cgroup: implement "nsdelegate" mount option 2017-06-28 14:45:21 -04:00
Changes docs: add back 'Documentation/Changes' file (as symlink) 2016-12-14 16:30:12 -08:00
circular-buffers.txt Documentation: circular-buffers: use READ_ONCE() 2016-11-16 16:17:45 -07:00
clk.txt
CodingStyle
conf.py Docs: Fix breakage with Sphinx 1.5 and upper 2017-06-23 13:45:37 -06:00
cpu-load.txt
cputopology.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00
dell_rbu.txt
digsig.txt
DMA-API-HOWTO.txt
DMA-API.txt Documentation: DMA API: fix a typo in a function name 2017-06-05 15:57:02 -06:00
DMA-attributes.txt common: DMA-mapping: add DMA_ATTR_PRIVILEGED attribute 2017-01-19 15:56:19 +00:00
DMA-ISA-LPC.txt Documentation: DMA-ISA-LPC.txt 2017-02-12 15:20:07 -07:00
docutils.conf
dontdiff GCC plugin updates: 2017-07-05 11:46:59 -07:00
efi-stub.txt
eisa.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcc-plugins.txt gcc-plugins: update architecture list in documentation 2017-03-21 22:20:05 +11:00
highuid.txt
hw_random.txt
hwspinlock.txt
index.rst Make the main documentation title less Geocities 2017-06-23 14:02:27 -06:00
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt Documentation: Fix a typo in IPMI.txt. 2017-01-05 15:01:54 -06:00
IRQ-affinity.txt
IRQ-domain.txt Documentation: Update IRQ-domain.txt to document irq_domain_mapping 2017-05-22 22:29:45 +02:00
IRQ.txt
irqflags-tracing.txt
isa.txt
isapnp.txt
kernel-doc-nano-HOWTO.txt docs: update old references for DocBook from the documentation 2017-05-16 08:44:19 -03:00
kernel-per-CPU-kthreads.txt rcu: Eliminate NOCBs CPU-state Kconfig options 2017-06-08 18:52:43 -07:00
kobject.txt
kprobes.txt
kref.txt Revert "kref: double kref_put() in my_data_handler()" 2017-04-08 18:38:10 +02:00
kselftest.txt scripts/spelling.txt: add "an user" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
ldm.txt
lockup-watchdogs.txt
logo.gif
logo.txt
lsm.txt docs-rst: convert lsm from DocBook to ReST 2017-05-16 08:44:19 -03:00
lzo.txt
mailbox.txt
Makefile docs: remove DocBook from the building system 2017-05-16 08:44:19 -03:00
memory-barriers.txt There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
memory-hotplug.txt scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
men-chameleon-bus.txt
nommu-mmap.txt
ntb.txt
numastat.txt
padata.txt
parport-lowlevel.txt
percpu-rw-semaphore.txt
phy.txt Documentation: phy: Fix repetition of word 'the' 2017-03-09 00:33:15 -07:00
pi-futex.txt
pinctrl.txt pinctrl: core: Fix pinctrl_register_and_init() with pinctrl_enable() 2017-04-07 01:08:08 +02:00
pnp.txt
preempt-locking.txt
printk-formats.txt
pwm.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rtc.txt
SAK.txt
sgi-ioc4.txt
siphash.txt siphash: implement HalfSipHash1-3 for hash tables 2017-01-09 13:58:57 -05:00
SM501.txt
smsc_ece1099.txt
static-keys.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00
SubmittingPatches
svga.txt
switchtec.txt switchtec: Add IOCTLs to the Switchtec driver 2017-04-12 12:23:37 -05:00
sync_file.txt Documentation: sync_file.txt: Fix typos 2017-03-17 13:03:36 -06:00
tee.txt Documentation: tee subsystem and op-tee driver 2017-03-10 14:51:57 +01:00
this_cpu_ops.txt
unaligned-memory-access.txt Documentation/unaligned-memory-access.txt: fix incorrect comparison operator 2016-12-27 13:08:42 -07:00
vfio-mediated-device.txt docs: Fix a spelling error in vfio-mediated-device.txt 2017-04-27 15:54:39 -06:00
vfio.txt
video-output.txt
xillybus.txt
xz.txt
zorro.txt docs: Fix a couple typos 2017-04-27 15:54:39 -06:00