linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-17 04:49:33 +07:00

History

Nikos Tsironis 721b1d98fb dm snapshot: Fix excessive memory usage and workqueue stalls kcopyd has no upper limit to the number of jobs one can allocate and issue. Under certain workloads this can lead to excessive memory usage and workqueue stalls. For example, when creating multiple dm-snapshot targets with a 4K chunk size and then writing to the origin through the page cache. Syncing the page cache causes a large number of BIOs to be issued to the dm-snapshot origin target, which itself issues an even larger (because of the BIO splitting taking place) number of kcopyd jobs. Running the following test, from the device mapper test suite [1], dmtest run --suite snapshot -n many_snapshots_of_same_volume_N , with 8 active snapshots, results in the kcopyd job slab cache growing to 10G. Depending on the available system RAM this can lead to the OOM killer killing user processes: [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL\|__GFP_COMP), nodemask=(null), order=1, oom_score_adj=0 [463.492894] kthreadd cpuset=/ mems_allowed=0 [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3 [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [463.492952] Call Trace: [463.492964] dump_stack+0x7d/0xbb [463.492973] dump_header+0x6b/0x2fc [463.492987] ? lockdep_hardirqs_on+0xee/0x190 [463.493012] oom_kill_process+0x302/0x370 [463.493021] out_of_memory+0x113/0x560 [463.493030] __alloc_pages_slowpath+0xf40/0x1020 [463.493055] __alloc_pages_nodemask+0x348/0x3c0 [463.493067] cache_grow_begin+0x81/0x8b0 [463.493072] ? cache_grow_begin+0x874/0x8b0 [463.493078] fallback_alloc+0x1e4/0x280 [463.493092] kmem_cache_alloc_node+0xd6/0x370 [463.493098] ? copy_process.part.31+0x1c5/0x20d0 [463.493105] copy_process.part.31+0x1c5/0x20d0 [463.493115] ? __lock_acquire+0x3cc/0x1550 [463.493121] ? __switch_to_asm+0x34/0x70 [463.493129] ? kthread_create_worker_on_cpu+0x70/0x70 [463.493135] ? finish_task_switch+0x90/0x280 [463.493165] _do_fork+0xe0/0x6d0 [463.493191] ? kthreadd+0x19f/0x220 [463.493233] kernel_thread+0x25/0x30 [463.493235] kthreadd+0x1bf/0x220 [463.493242] ? kthread_create_on_cpu+0x90/0x90 [463.493248] ret_from_fork+0x3a/0x50 [463.493279] Mem-Info: [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0 [463.493285] active_file:80216 inactive_file:80107 isolated_file:435 [463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0 [463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521 [463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0 [463.493285] free:33623 free_pcp:2392 free_cma:0 ... [463.493489] Unreclaimable slab info: [463.493513] Name Used Total [463.493522] bio-6 1028KB 1028KB [463.493525] bio-5 1028KB 1028KB [463.493528] dm_snap_pending_exception 236783KB 243789KB [463.493531] dm_exception 41KB 42KB [463.493534] bio-4 1216KB 1216KB [463.493537] bio-3 439396KB 439396KB [463.493539] kcopyd_job 6973427KB 6973427KB ... [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Moreover, issuing a large number of kcopyd jobs results in kcopyd hogging the CPU, while processing them. As a result, processing of work items, queued for execution on the same CPU as the currently running kcopyd thread, is stalled for long periods of time, hurting performance. Running the aforementioned test we get, in dmesg, messages like the following: [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s! [67501.195586] Showing busy workqueues and worker pools: [67501.195591] workqueue events: flags=0x0 [67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195611] pending: cache_reap [67501.195641] workqueue mm_percpu_wq: flags=0x8 [67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195656] pending: vmstat_update [67501.195682] workqueue kblockd: flags=0x18 [67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256 [67501.195698] pending: blk_timeout_work [67501.195753] workqueue kcopyd: flags=0x8 [67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195768] pending: do_work [dm_mod] [67501.195802] workqueue kcopyd: flags=0x8 [67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195817] pending: do_work [dm_mod] [67501.195834] workqueue kcopyd: flags=0x8 [67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195848] pending: do_work [dm_mod] [67501.195881] workqueue kcopyd: flags=0x8 [67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 [67501.195896] pending: do_work [dm_mod] [67501.195920] workqueue kcopyd: flags=0x8 [67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256 [67501.195935] in-flight: 67:do_work [dm_mod] [67501.195945] pending: do_work [dm_mod] [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765 The root cause for these issues is the way dm-snapshot uses kcopyd. In particular, the lack of an explicit or implicit limit to the maximum number of in-flight COW jobs. The merging path is not affected because it implicitly limits the in-flight kcopyd jobs to one. Fix these issues by using a semaphore to limit the maximum number of in-flight kcopyd jobs. We grab the semaphore before allocating a new kcopyd job in start_copy() and start_full_bio() and release it after the job finishes in copy_callback(). The initial semaphore value is configurable through a module parameter, to allow fine tuning the maximum number of in-flight COW jobs. Setting this parameter to zero initializes the semaphore to INT_MAX. A default value of 2048 maximum in-flight kcopyd jobs was chosen. This value was decided experimentally as a trade-off between memory consumption, stalling the kernel's workqueues and maintaining a high enough throughput. Re-running the aforementioned test: * Workqueue stalls are eliminated * kcopyd's job slab cache uses a maximum of 130MB * The time taken by the test to write to the snapshot-origin target is reduced from 05m20.48s to 03m26.38s [1] https://github.com/jthornber/device-mapper-test-suite Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>		2018-12-18 09:02:26 -05:00
..
accessibility
acpi	libnvdimm fixes 4.20-rc6	2018-12-09 09:46:54 -08:00
amba
android	binder: fix race that allows malicious free of live buffer	2018-11-26 20:01:47 +01:00
ata	Linux 4.20-rc6	2018-12-09 17:45:40 -07:00
atm	firestream: fix spelling mistake: "Inititing" -> "Initializing"	2018-11-27 15:32:06 -08:00
auxdisplay	The Compiler Attributes series	2018-11-01 18:34:46 -07:00
base	devres: Align data[] to ARCH_KMALLOC_MINALIGN	2018-11-11 11:40:04 -08:00
bcma
block	block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add()	2018-12-16 09:01:38 -07:00
bluetooth	Merge branch 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-10-24 14:43:41 +01:00
bus	ARM: SoC driver updates for 4.17	2018-10-29 15:16:01 -07:00
cdrom	gdrom: fix mistake in assignment of error	2018-10-25 11:17:40 -06:00
char	RTC for 4.20	2018-10-27 09:24:24 -07:00
clk	clk: zynqmp: Off by one in zynqmp_is_valid_clock()	2018-12-03 09:54:48 -08:00
clocksource	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2018-11-11 16:41:50 -06:00
connector
cpufreq	cpufreq: ti-cpufreq: Only register platform_device when supported	2018-11-19 11:26:06 +01:00
cpuidle	ARM: cpuidle: Convert to use cpuidle_register\|unregister()	2018-11-08 18:53:00 +01:00
crypto	crypto: hisilicon - Fix reference after free of memories on error path	2018-11-09 17:35:43 +08:00
dax
dca
devfreq
dio
dma	dmaengine: dw: Fix FIFO size for Intel Merrifield	2018-12-06 22:53:05 +05:30
dma-buf	udmabuf: set read/write flag when exporting	2018-11-16 08:50:53 +01:00
edac	* skx_edac: Address translation for NVDIMMs (Tony Luck and Qiuxu Zhuo)	2018-11-02 11:17:22 -07:00
eisa
extcon
firewire
firmware	efi: Prevent GICv3 WARN() by mapping the memreserve table before first use	2018-11-27 13:50:20 +01:00
fmc
fpga	fpga: add devm_fpga_region_create	2018-10-16 11:13:50 +02:00
fsi	fsi: fsi-scom.c: Remove duplicate header	2018-11-26 10:13:04 +11:00
gnss	gnss: sirf: fix activation retry handling	2018-12-06 17:22:23 +01:00
gpio	ARM: SoC fixes	2018-12-02 12:19:44 -08:00
gpu	drm/ast: Fix connector leak during driver unload	2018-12-06 14:12:02 +10:00
hid	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input	2018-12-04 08:47:04 -08:00
hsi
hv	Drivers: hv: vmbus: Offload the handling of channels to two workqueues	2018-12-03 08:01:01 +01:00
hwmon	hwmon: (w83795) temp4_type has writable permission	2018-11-18 14:34:56 -08:00
hwspinlock
hwtracing	stm class: Use memcat_p()	2018-10-11 12:12:55 +02:00
i2c	i2c: uniphier-f: fix violation of tLOW requirement for Fast-mode	2018-12-06 23:14:59 +01:00
ide	Linux 4.20-rc6	2018-12-09 17:45:40 -07:00
idle	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2018-10-23 13:32:18 +01:00
iio	iio/hid-sensors: Fix IIO_CHAN_INFO_RAW returning wrong values for signed numbers	2018-11-16 11:42:12 +00:00
infiniband	RDMA/mlx5: Initialize return variable in case pagefault was skipped	2018-11-29 15:16:45 -07:00
input	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input	2018-12-04 08:47:04 -08:00
iommu	iommu/vt-d: Use memunmap to free memremap	2018-11-22 17:02:21 +01:00
ipack
irqchip	irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function	2018-11-01 12:38:48 +01:00
isdn	Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2018-11-01 19:58:52 -07:00
leds	LED fixes for 4.20-rc2	2018-11-08 17:49:04 -06:00
lightnvm	lightnvm: pblk: do not overwrite ppa list with meta list	2018-12-11 12:22:35 -07:00
macintosh	memblock: stop using implicit alignment to SMP_CACHE_BYTES	2018-10-31 08:54:16 -07:00
mailbox	- Convert print users to use the %pOFn format specifier	2018-10-29 10:30:44 -07:00
mcb
md	dm snapshot: Fix excessive memory usage and workqueue stalls	2018-12-18 09:02:26 -05:00
media	media: dvb-pll: don't re-validate tuner frequencies	2018-11-27 13:51:32 -05:00
memory
memstick	ms_block: remove unused pointer 'set'	2018-11-08 06:17:26 -07:00
message
mfd	Revert "mfd: cros_ec: Use devm_kzalloc for private data"	2018-12-05 09:59:38 +00:00
misc	misc: mic/scif: fix copy-paste error in scif_create_remote_lookup	2018-11-27 09:00:38 +01:00
mmc	Linux 4.20-rc5	2018-12-04 09:38:05 -07:00
mtd	mtd: nand: Fix memory allocation in nanddev_bbt_init()	2018-11-28 15:41:50 +01:00
mux	This is the bulk of GPIO changes for the v4.20 series:	2018-10-23 08:45:05 +01:00
net	ath6kl: add ath6kl_ prefix to crypto_type	2018-12-13 09:58:52 +01:00
nfc	NFC: nfcmrvl_uart: fix OF child-node lookup	2018-10-23 13:28:53 -05:00
ntb	ntb: idt: Alter the driver info comments	2018-11-01 10:33:12 -04:00
nubus
nvdimm	Linux 4.20-rc6	2018-12-09 17:45:40 -07:00
nvme	nvme-pci: don't share queue maps	2018-12-17 05:44:45 -07:00
nvmem	nvmem: core: fix regression in of_nvmem_cell_get()	2018-11-11 09:15:29 -08:00
of	Devicetree fixes for 4.20-rc:	2018-11-09 16:41:58 -06:00
opp	OPP: Fix parsing of multiple phandles in "operating-points-v2" property	2018-11-23 10:47:21 +05:30
oprofile
parisc	parisc: Add alternative coding infrastructure	2018-10-17 17:22:26 +02:00
parport
pci	Linux 4.20-rc6	2018-12-09 17:45:40 -07:00
pcmcia	powerpc updates for 4.20	2018-10-26 14:36:21 -07:00
perf	arm64 updates for 4.20:	2018-10-22 17:30:06 +01:00
phy	phy: qcom-qusb2: Fix HSTX_TRIM tuning with fused value for SDM845	2018-11-21 13:13:58 +05:30
pinctrl	pinctrl: meson: fix meson8b ao pull register bits	2018-11-05 09:33:22 +01:00
platform	platform-drivers-x86 for v4.20-1	2018-11-01 08:42:21 -07:00
pnp
power	Devicetree updates for 4.20:	2018-10-26 12:09:58 -07:00
powercap	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2018-10-23 13:32:18 +01:00
pps
ps3
ptp	ptp: drop redundant kasprintf() to create worker name	2018-10-28 19:20:06 -07:00
pwm	pwm: lpss: Only set update bit if we are actually changing the settings	2018-10-16 13:16:15 +02:00
rapidio
ras
regulator	regulator: Regulator updates for next release	2018-10-23 01:54:44 +01:00
remoteproc	remoteproc: qcom: q6v5-mss: Register segments/dumpfn for coredump	2018-10-19 12:54:03 -07:00
reset	ARM: SoC driver updates for 4.17	2018-10-29 15:16:01 -07:00
rpmsg	rpmsg: glink: smem: Support rx peak for size less than 4 bytes	2018-10-03 17:04:32 -07:00
rtc	Staging and IIO driver fixes for 4.20-rc5	2018-11-30 12:23:44 -08:00
s390	Linux 4.20-rc6	2018-12-09 17:45:40 -07:00
sbus	drivers/sbus/char: add of_node_put()	2018-12-02 20:55:23 -08:00
scsi	Linux 4.20-rc6	2018-12-09 17:45:40 -07:00
sfi	mm: remove include/linux/bootmem.h	2018-10-31 08:54:16 -07:00
sh
siox
slimbus	slimbus: ngd: remove unnecessary check	2018-11-07 14:59:28 +01:00
sn
soc	soc: ti: QMSS: Fix usage of irq_set_affinity_hint	2018-11-02 11:22:09 -07:00
soundwire
spi	spi: Fixes for v4.20	2018-11-28 08:33:55 -08:00
spmi
ssb	ssb: chipcommon: fix fall-through annotation	2018-10-05 11:37:20 +03:00
staging	Staging fixes for 4.20-rc6	2018-12-09 10:35:33 -08:00
target	sbitmap: optimize wakeup check	2018-11-30 14:48:04 -07:00
tc	TC: Set DMA masks for devices	2018-10-11 09:16:44 -07:00
tee
thermal	thermal: broadcom: constify thermal_zone_of_device_ops structure	2018-12-05 06:47:46 -08:00
thunderbolt	thunderbolt: Prevent root port runtime suspend during NVM upgrade	2018-11-26 20:38:49 +01:00
tty	TTY driver fixes for 4.20-rc6	2018-12-09 10:24:29 -08:00
uio	uio: Fix an Oops on load	2018-11-11 09:21:46 -08:00
usb	USB-serial fix for v4.20-rc6	2018-12-06 18:02:58 +01:00
uwb
vfio	VFIO updates for v4.20	2018-10-31 11:01:38 -07:00
vhost	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-12-09 15:12:33 -08:00
video	fbdev changes for v4.20:	2018-10-31 11:41:37 -07:00
virt
virtio	virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON	2018-10-24 20:57:55 -04:00
visorbus
vlynq
vme
w1	w1: IAD Register is yet readable trough iad sys file. Fix snprintf (%u for unsigned, count for max size).	2018-10-15 20:50:32 +02:00
watchdog	watchdog: ts4800: release syscon device node in ts4800_wdt_probe()	2018-10-22 10:16:28 +02:00
xen	xen: fixes for 4.20-rc5	2018-12-02 12:15:55 -08:00
zorro
Kconfig
Makefile