linux_dsm_epyc7002/kernel
Christian Brauner ba7d25f3df
exit: support non-blocking pidfds
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.  is not
supported. There are users which would like to use waitid() on pidfds that are
O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
waitid().
The expected behavior is to have waitid() return -EAGAIN for non-blocking
pidfds and to block for blocking pidfds without needing to perform any
additional checks for flags set on the pidfd before passing it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child process is
ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
loops that handle EAGAIN specially.

It also makes the API more consistent and uniform. In essence, waitid() is
treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
contrast non-blocking pidfds and non-blocking sockets are a setting on an open
file description affecting all threads in the calling process as well as other
processes that hold file descriptors referring to the same open file
description. Both behaviors, per call and per open file description, have
genuine use-cases.

The implementation should be straightforward:
- If a non-blocking pidfd is passed and WNOHANG is not raised we simply raise
  the WNOHANG flag internally. When do_wait() returns indicating that there are
  eligible child processes but none have exited yet we set EAGAIN. If no child
  process exists we continue returning ECHILD.
- If a non-blocking pidfd is passed and WNOHANG is raised waitid() will
  continue returning 0, i.e. it will not set EAGAIN. This ensure backwards
  compatibility with applications passing WNOHANG explicitly with pidfds.

A concrete use-case that was brought on-list was Josh's async pidfd library.
Ever since the introduction of pidfds and more advanced async io various
programming languages such as Rust have grown support for async event
libraries. These libraries are created to help build epoll-based event loops
around file descriptors. A common pattern is to automatically make all file
descriptors they manage to O_NONBLOCK.

For such libraries the EAGAIN error code is treated specially. When a function
is called that returns EAGAIN the function isn't called again until the event
loop indicates the the file descriptor is ready.  Supporting EAGAIN when
waiting on pidfds makes such libraries just work with little effort.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Link: https://github.com/joshtriplett/async-pidfd
Link: https://lore.kernel.org/r/20200902102130.147672-3-christian.brauner@ubuntu.com
2020-09-04 12:31:30 +02:00
..
bpf bpf: Avoid visit same object multiple times 2020-08-18 17:36:23 -07:00
cgroup for-5.9/block-20200802 2020-08-03 11:57:03 -07:00
configs compiler: remove CONFIG_OPTIMIZE_INLINING entirely 2020-04-07 10:43:42 -07:00
debug Remove uninitialized_var() macro for v5.9-rc1 2020-08-04 13:49:43 -07:00
dma dma-mapping fixes for 5.9 2020-08-20 10:48:17 -07:00
entry core/entry: Respect syscall number rewrites 2020-08-21 16:17:29 +02:00
events uprobes: __replace_page() avoid BUG in munlock_vma_page() 2020-08-21 09:52:53 -07:00
gcov treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
irq genirq: Unlock irq descriptor after errors 2020-08-13 09:35:59 +02:00
kcsan Merge branch 'kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core 2020-08-01 09:26:27 +02:00
livepatch livepatch: Make klp_apply_object_relocs static 2020-05-11 00:31:38 +02:00
locking A set of locking fixes and updates: 2020-08-10 19:07:44 -07:00
power libnvdimm for 5.9 2020-08-11 10:59:19 -07:00
printk Printk changes for 5.9 2020-08-04 22:22:25 -07:00
rcu rcu: kasan: record and print call_rcu() call stack 2020-08-07 11:33:28 -07:00
sched Two fixes: fix a new tracepoint's output value, and fix the formatting of show-state syslog printouts. 2020-08-15 10:36:40 -07:00
time A set oftimekeeping/VDSO updates: 2020-08-14 14:26:08 -07:00
trace Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-08-13 20:03:11 -07:00
.gitignore .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
acct.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
async.c treewide: Remove uninitialized_var() usage 2020-07-16 12:35:15 -07:00
audit_fsnotify.c fsnotify: create method handle_inode_event() in fsnotify_operations 2020-07-27 23:25:50 +02:00
audit_tree.c \n 2020-08-06 19:29:51 -07:00
audit_watch.c fsnotify: create method handle_inode_event() in fsnotify_operations 2020-07-27 23:25:50 +02:00
audit.c audit/stable-5.9 PR 20200803 2020-08-04 14:20:26 -07:00
audit.h revert: 1320a4052e ("audit: trigger accompanying records when no rules present") 2020-07-29 10:00:36 -04:00
auditfilter.c audit: fix a net reference leak in audit_list_rules_send() 2020-04-22 15:23:10 -04:00
auditsc.c audit/stable-5.9 PR 20200803 2020-08-04 14:20:26 -07:00
backtracetest.c treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD() 2020-07-30 11:15:58 -07:00
bounds.c
capability.c
compat.c uaccess: Selectively open read or write user access 2020-05-01 12:35:21 +10:00
configs.c proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
context_tracking.c context_tracking: Ensure that the critical path cannot be instrumented 2020-06-11 15:14:36 +02:00
cpu_pm.c kernel/cpu_pm: Fix uninitted local in cpu_pm 2020-05-15 11:44:34 -07:00
cpu.c The changes in this cycle are: 2020-06-03 13:06:42 -07:00
crash_core.c kdump: append kernel build-id string to VMCOREINFO 2020-08-12 10:58:01 -07:00
crash_dump.c crash_dump: Remove no longer used saved_max_pfn 2020-04-15 11:21:54 +02:00
cred.c exec: Teach prepare_exec_creds how exec treats uids & gids 2020-05-20 14:44:21 -05:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c exit: support non-blocking pidfds 2020-09-04 12:31:30 +02:00
extable.c kernel/extable.c: use address-of operator on section symbols 2020-04-07 10:43:42 -07:00
fail_function.c
fork.c A set of locking fixes and updates: 2020-08-10 19:07:44 -07:00
freezer.c
futex.c futex: Convert to use the preferred 'fallthrough' macro 2020-08-13 21:02:12 +02:00
gen_kheaders.sh kbuild: add variables for compression tools 2020-06-06 23:42:01 +09:00
groups.c mm: remove the pgprot argument to __vmalloc 2020-06-02 10:59:11 -07:00
hung_task.c kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected 2020-06-08 11:05:56 -07:00
iomem.c
irq_work.c irq_work, smp: Allow irq_work on call_single_queue 2020-05-28 10:54:15 +02:00
jump_label.c
kallsyms.c Linux 5.8-rc7 2020-07-28 13:18:01 +02:00
kcmp.c kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve 2020-03-25 10:04:01 -05:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks sched/rt, locking: Use CONFIG_PREEMPTION 2019-12-08 14:37:36 +01:00
Kconfig.preempt
kcov.c kcov: make some symbols static 2020-08-12 10:58:02 -07:00
kexec_core.c kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kexec_elf.c
kexec_file.c Misc fixes and small updates all around the place: 2020-08-15 10:38:03 -07:00
kexec_internal.h kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kexec.c kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kheaders.c
kmod.c kmod: remove redundant "be an" in the comment 2020-08-12 10:58:01 -07:00
kprobes.c Tracing updates for 5.9 2020-08-07 18:29:15 -07:00
ksysfs.c
kthread.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
latencytop.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
Makefile all arch: remove system call sys_sysctl 2020-08-14 19:56:56 -07:00
module_signature.c
module_signing.c
module-internal.h
module.c Modules updates for v5.9 2020-08-14 11:07:02 -07:00
notifier.c mm: remove vmalloc_sync_(un)mappings() 2020-06-02 10:59:12 -07:00
nsproxy.c nsproxy: support CLONE_NEWTIME with setns() 2020-07-08 11:14:22 +02:00
padata.c padata: remove padata_parallel_queue 2020-07-23 17:34:18 +10:00
panic.c panic: make print_oops_end_marker() static 2020-08-12 10:58:02 -07:00
params.c
pid_namespace.c pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid 2020-07-19 20:14:42 +02:00
pid.c cap-checkpoint-restore-v5.9 2020-08-04 15:02:07 -07:00
profile.c proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
ptrace.c ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() 2020-01-18 13:51:39 +01:00
range.c
reboot.c arch: remove unicore32 port 2020-07-01 12:09:13 +03:00
regset.c regset: kill ->get() 2020-07-27 14:31:12 -04:00
relay.c kernel/relay.c: fix memleak on destroy relay channel 2020-08-21 09:52:53 -07:00
resource.c /dev/mem: Revoke mappings when a driver claims the region 2020-05-27 11:10:05 +02:00
rseq.c rseq: Reject unknown flags on rseq unregister 2019-12-25 10:41:20 +01:00
scs.c mm: memcontrol: account kernel stack per node 2020-08-07 11:33:25 -07:00
seccomp.c seccomp: Introduce addfd ioctl to seccomp user notifier 2020-07-14 16:29:42 -07:00
signal.c task_work: only grab task signal lock when needed 2020-08-13 09:01:38 -06:00
smp.c smp: Fix a potential usage of stale nr_cpus 2020-07-22 10:22:04 +02:00
smpboot.c
smpboot.h
softirq.c tasklets API update for v5.9-rc1 2020-08-04 13:40:35 -07:00
stackleak.c gcc-plugins/stackleak: Use asm instrumentation to avoid useless register saving 2020-06-24 07:48:28 -07:00
stacktrace.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
stop_machine.c stop_machine: Make stop_cpus() static 2020-01-17 10:19:21 +01:00
sys_ni.c all arch: remove system call sys_sysctl 2020-08-14 19:56:56 -07:00
sys.c prctl: exe link permission error changed from -EINVAL to -EPERM 2020-07-19 20:14:42 +02:00
sysctl-test.c kunit: allow kunit tests to be loaded as a module 2020-01-09 16:42:29 -07:00
sysctl.c mm: use unsigned types for fragmentation score 2020-08-12 10:57:56 -07:00
task_work.c task_work: only grab task signal lock when needed 2020-08-13 09:01:38 -06:00
taskstats.c taskstats: fix data-race 2019-12-04 15:18:39 +01:00
test_kprobes.c
torture.c torture: Dump ftrace at shutdown only if requested 2020-06-29 12:01:45 -07:00
tracepoint.c
tsacct.c tsacct: add 64-bit btime field 2019-12-18 18:07:31 +01:00
ucount.c ucount: Make sure ucounts in /proc/sys/user don't regress again 2020-04-07 21:51:27 +02:00
uid16.c
uid16.h
umh.c kernel: add a kernel_wait helper 2020-08-12 10:57:59 -07:00
up.c smp/up: Make smp_call_function_single() match SMP semantics 2020-02-07 15:34:12 +01:00
user_namespace.c nsproxy: add struct nsset 2020-05-09 13:57:12 +02:00
user-return-notifier.c
user.c user.c: make uidhash_table static 2020-06-04 19:06:24 -07:00
usermode_driver.c umd: Stop using split_argv 2020-07-07 11:58:59 -05:00
utsname_sysctl.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
utsname.c nsproxy: add struct nsset 2020-05-09 13:57:12 +02:00
watch_queue.c watch_queue: Limit the number of watches a user can hold 2020-08-17 09:39:18 -07:00
watchdog_hld.c
watchdog.c kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases 2020-06-08 11:05:56 -07:00
workqueue_internal.h
workqueue.c maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00