linux_dsm_epyc7002/kernel
Mike Christie 8d19f1c8e1
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.

In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.

Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory.  Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.

[ 1626.609191] msgr-worker-1   D    0  1026      1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195]  ? __schedule+0x29b/0x630
[ 1626.609197]  ? wait_for_completion+0xe0/0x170
[ 1626.609198]  schedule+0x30/0xb0
[ 1626.609200]  schedule_timeout+0x1f6/0x2f0
[ 1626.609202]  ? blk_finish_plug+0x21/0x2e
[ 1626.609204]  ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206]  ? wait_for_completion+0xe0/0x170
[ 1626.609208]  wait_for_completion+0x108/0x170
[ 1626.609210]  ? wake_up_q+0x70/0x70
[ 1626.609212]  ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214]  ? xfs_bwrite+0x25/0x60
[ 1626.609215]  xfs_buf_iowait+0x22/0xf0
[ 1626.609218]  __xfs_buf_submit+0x12e/0x250
[ 1626.609220]  xfs_bwrite+0x25/0x60
[ 1626.609222]  xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224]  xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227]  xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228]  super_cache_scan+0x152/0x1a0
[ 1626.609231]  do_shrink_slab+0x12c/0x2d0
[ 1626.609233]  shrink_slab+0x9c/0x2a0
[ 1626.609235]  shrink_node+0xd7/0x470
[ 1626.609237]  do_try_to_free_pages+0xbf/0x380
[ 1626.609240]  try_to_free_pages+0xd9/0x1f0
[ 1626.609245]  __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251]  ? ___slab_alloc+0x238/0x560
[ 1626.609254]  __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259]  skb_page_frag_refill+0x97/0xd0
[ 1626.609274]  sk_page_frag_refill+0x1d/0x80
[ 1626.609279]  tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304]  tcp_sendmsg+0x27/0x40
[ 1626.609307]  sock_sendmsg+0x54/0x60
[ 1626.609308]  ___sys_sendmsg+0x29f/0x320
[ 1626.609313]  ? sock_poll+0x66/0xb0
[ 1626.609318]  ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320]  ? ep_send_events_proc+0xe6/0x230
[ 1626.609322]  ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324]  ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326]  ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327]  ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329]  ? __hrtimer_init+0xb0/0xb0
[ 1626.609331]  ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334]  ? ep_poll+0x26c/0x4a0
[ 1626.609337]  ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339]  ? release_sock+0x43/0x90
[ 1626.609341]  ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342]  __sys_sendmsg+0x47/0x80
[ 1626.609347]  do_syscall_64+0x5f/0x1c0
[ 1626.609349]  ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-28 10:09:51 +01:00
..
bpf bpf: Fix precision tracking for unbounded scalars 2019-12-22 17:21:10 -08:00
cgroup Merge branch 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2019-11-25 19:23:46 -08:00
configs
debug kdb: Tweak escape handling for vi users 2019-10-28 12:08:29 +00:00
dma lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr 2019-12-04 19:44:13 -08:00
events perf/core: Add SRCU annotation for pmus list walk 2019-12-17 13:32:46 +01:00
gcov um: Enable CONFIG_CONSTRUCTORS 2019-09-15 21:37:13 +02:00
irq irqchip updates for Linux 5.5 2019-11-20 14:16:34 +01:00
livepatch New tracing features: 2019-11-27 11:42:01 -08:00
locking Revert "locking/mutex: Complain upon mutex API misuse in IRQ contexts" 2019-12-11 00:27:43 +01:00
power Additional power management updates for 5.5-rc1 2019-12-04 10:48:09 -08:00
printk Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-12-03 09:29:50 -08:00
rcu Merge branches 'doc.2019.10.29a', 'fixes.2019.10.30a', 'nohz.2019.10.28a', 'replace.2019.10.30a', 'torture.2019.10.05a' and 'lkmm.2019.10.05a' into HEAD 2019-10-30 08:47:13 -07:00
sched Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-12-21 10:52:10 -08:00
time ptp: fix the race between the release of ptp_clock and cdev 2019-12-30 20:19:27 -08:00
trace tracing: Fix endianness bug in histogram trigger 2019-12-21 16:08:59 -05:00
.gitignore
acct.c
async.c
audit_fsnotify.c
audit_tree.c
audit_watch.c audit_get_nd(): don't unlock parent too early 2019-11-10 11:56:55 -05:00
audit.c audit: remove redundant condition check in kauditd_thread() 2019-10-25 11:48:14 -04:00
audit.h
auditfilter.c
auditsc.c Revert "bpf: Emit audit messages upon successful prog load and unload" 2019-11-23 09:56:02 -08:00
backtracetest.c
bounds.c
capability.c
compat.c y2038: itimer: compat handling to itimer.c 2019-11-15 14:38:30 +01:00
configs.c
context_tracking.c context_tracking: Rename context_tracking_is_enabled() => context_tracking_enabled() 2019-10-29 10:01:12 +01:00
cpu_pm.c
cpu.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-11-26 16:02:40 -08:00
crash_core.c
crash_dump.c
cred.c memcg: account security cred as well to kmemcg 2020-01-04 13:55:09 -08:00
delayacct.c
dma.c
elfcore.c kernel/elfcore.c: include proper prototypes 2019-09-25 17:51:39 -07:00
exec_domain.c
exit.c for-linus-2020-01-03 2020-01-03 11:17:14 -08:00
extable.c bpf: Add support for BTF pointers to x86 JIT 2019-10-17 16:44:36 +02:00
fail_function.c
fork.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-12-03 12:20:25 -08:00
freezer.c Revert "libata, freezer: avoid block device removal while system is frozen" 2019-10-06 09:11:37 -06:00
futex.c futex: Prevent exit livelock 2019-11-20 09:40:38 +01:00
gen_kheaders.sh kheaders: explain why include/config/autoconf.h is excluded from md5sum 2019-11-11 20:10:01 +09:00
groups.c
hung_task.c
iomem.c
irq_work.c irq_work: Fix IRQ_WORK_BUSY bit clearing 2019-11-15 10:48:37 +01:00
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt sched/Kconfig: Fix spelling mistake in user-visible help text 2019-11-12 11:35:32 +01:00
kcov.c kcov: remote coverage support 2019-12-04 19:44:14 -08:00
kexec_core.c kexec: bail out upon SIGKILL when allocating memory. 2019-09-25 17:51:40 -07:00
kexec_elf.c kexec_elf: support 32 bit ELF files 2019-09-06 23:58:44 +02:00
kexec_file.c kexec: Fix pointer-to-int-cast warnings 2019-11-01 21:42:58 +01:00
kexec_internal.h
kexec.c
kheaders.c
kmod.c
kprobes.c Tracing updates: 2019-09-20 11:19:48 -07:00
ksysfs.c
kthread.c kthread: make __kthread_queue_delayed_work static 2019-10-16 09:20:58 -07:00
latencytop.c
Makefile Kbuild updates for v5.5 2019-12-02 17:35:04 -08:00
module_signature.c
module_signing.c
module-internal.h
module.c This contains 3 changes: 2019-12-11 12:22:38 -08:00
notifier.c kernel/notifier.c: remove blocking_notifier_chain_cond_register() 2019-12-04 19:44:12 -08:00
nsproxy.c
padata.c padata: remove cpu_index from the parallel_queue 2019-09-13 21:15:41 +10:00
panic.c locking/refcount: Remove unused 'refcount_error_report()' function 2019-11-25 09:15:42 +01:00
params.c
pid_namespace.c fork: extend clone3() to support setting a PID 2019-11-15 23:49:22 +01:00
pid.c pid: Implement pidfd_getfd syscall 2020-01-13 21:49:36 +01:00
profile.c kernel/profile.c: use cpumask_available to check for NULL cpumask 2019-12-04 19:44:12 -08:00
ptrace.c
range.c
reboot.c
relay.c
resource.c mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range() 2019-09-24 15:54:09 -07:00
rseq.c
seccomp.c seccomp: Check that seccomp_notif is zeroed out by the user 2020-01-02 13:03:45 -08:00
signal.c sched.h: Annotate sighand_struct with __rcu 2020-01-26 10:54:47 +01:00
smp.c
smpboot.c
smpboot.h
softirq.c
stackleak.c
stacktrace.c stacktrace: Get rid of unneeded '!!' pattern 2019-11-11 10:30:59 +01:00
stop_machine.c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu 2019-10-31 09:33:19 +01:00
sys_ni.c y2038: allow disabling time32 system calls 2019-11-15 14:38:30 +01:00
sys.c prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim 2020-01-28 10:09:51 +01:00
sysctl_binary.c sysctl: Remove the sysctl system call 2019-11-26 13:03:56 -06:00
sysctl-test.c kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec() 2019-09-30 17:35:01 -06:00
sysctl.c kernel: sysctl: make drop_caches write-only 2019-12-01 12:59:07 -08:00
task_work.c
taskstats.c taskstats: fix data-race 2019-12-04 15:18:39 +01:00
test_kprobes.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user_namespace.c
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog_hld.c
watchdog.c
workqueue_internal.h
workqueue.c workqueue: Use pr_warn instead of pr_warning 2019-12-06 09:59:30 +01:00