linux_dsm_epyc7002/kernel
Christian Brauner ba7d25f3df
exit: support non-blocking pidfds
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.  is not
supported. There are users which would like to use waitid() on pidfds that are
O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
waitid().
The expected behavior is to have waitid() return -EAGAIN for non-blocking
pidfds and to block for blocking pidfds without needing to perform any
additional checks for flags set on the pidfd before passing it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child process is
ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
loops that handle EAGAIN specially.

It also makes the API more consistent and uniform. In essence, waitid() is
treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
contrast non-blocking pidfds and non-blocking sockets are a setting on an open
file description affecting all threads in the calling process as well as other
processes that hold file descriptors referring to the same open file
description. Both behaviors, per call and per open file description, have
genuine use-cases.

The implementation should be straightforward:
- If a non-blocking pidfd is passed and WNOHANG is not raised we simply raise
  the WNOHANG flag internally. When do_wait() returns indicating that there are
  eligible child processes but none have exited yet we set EAGAIN. If no child
  process exists we continue returning ECHILD.
- If a non-blocking pidfd is passed and WNOHANG is raised waitid() will
  continue returning 0, i.e. it will not set EAGAIN. This ensure backwards
  compatibility with applications passing WNOHANG explicitly with pidfds.

A concrete use-case that was brought on-list was Josh's async pidfd library.
Ever since the introduction of pidfds and more advanced async io various
programming languages such as Rust have grown support for async event
libraries. These libraries are created to help build epoll-based event loops
around file descriptors. A common pattern is to automatically make all file
descriptors they manage to O_NONBLOCK.

For such libraries the EAGAIN error code is treated specially. When a function
is called that returns EAGAIN the function isn't called again until the event
loop indicates the the file descriptor is ready.  Supporting EAGAIN when
waiting on pidfds makes such libraries just work with little effort.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Link: https://github.com/joshtriplett/async-pidfd
Link: https://lore.kernel.org/r/20200902102130.147672-3-christian.brauner@ubuntu.com
2020-09-04 12:31:30 +02:00
..
bpf bpf: Avoid visit same object multiple times 2020-08-18 17:36:23 -07:00
cgroup
configs
debug
dma dma-mapping fixes for 5.9 2020-08-20 10:48:17 -07:00
entry core/entry: Respect syscall number rewrites 2020-08-21 16:17:29 +02:00
events uprobes: __replace_page() avoid BUG in munlock_vma_page() 2020-08-21 09:52:53 -07:00
gcov
irq genirq: Unlock irq descriptor after errors 2020-08-13 09:35:59 +02:00
kcsan
livepatch
locking A set of locking fixes and updates: 2020-08-10 19:07:44 -07:00
power libnvdimm for 5.9 2020-08-11 10:59:19 -07:00
printk
rcu rcu: kasan: record and print call_rcu() call stack 2020-08-07 11:33:28 -07:00
sched Two fixes: fix a new tracepoint's output value, and fix the formatting of show-state syslog printouts. 2020-08-15 10:36:40 -07:00
time A set oftimekeeping/VDSO updates: 2020-08-14 14:26:08 -07:00
trace Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-08-13 20:03:11 -07:00
.gitignore
acct.c
async.c
audit_fsnotify.c
audit_tree.c \n 2020-08-06 19:29:51 -07:00
audit_watch.c
audit.c
audit.h
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c
crash_core.c kdump: append kernel build-id string to VMCOREINFO 2020-08-12 10:58:01 -07:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c exit: support non-blocking pidfds 2020-09-04 12:31:30 +02:00
extable.c
fail_function.c
fork.c A set of locking fixes and updates: 2020-08-10 19:07:44 -07:00
freezer.c
futex.c futex: Convert to use the preferred 'fallthrough' macro 2020-08-13 21:02:12 +02:00
gen_kheaders.sh
groups.c
hung_task.c
iomem.c
irq_work.c
jump_label.c jump_label: Don't warn on __exit jump entries 2019-08-29 15:10:10 +01:00
kallsyms.c
kcmp.c
Kconfig.freezer treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: make some symbols static 2020-08-12 10:58:02 -07:00
kexec_core.c
kexec_elf.c
kexec_file.c Misc fixes and small updates all around the place: 2020-08-15 10:38:03 -07:00
kexec_internal.h
kexec.c
kheaders.c
kmod.c kmod: remove redundant "be an" in the comment 2020-08-12 10:58:01 -07:00
kprobes.c Tracing updates for 5.9 2020-08-07 18:29:15 -07:00
ksysfs.c
kthread.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
latencytop.c
Makefile all arch: remove system call sys_sysctl 2020-08-14 19:56:56 -07:00
module_signature.c
module_signing.c
module-internal.h
module.c Modules updates for v5.9 2020-08-14 11:07:02 -07:00
notifier.c
nsproxy.c
padata.c
panic.c panic: make print_oops_end_marker() static 2020-08-12 10:58:02 -07:00
params.c
pid_namespace.c pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid 2020-07-19 20:14:42 +02:00
pid.c
profile.c
ptrace.c
range.c
reboot.c
regset.c regset: kill ->get() 2020-07-27 14:31:12 -04:00
relay.c kernel/relay.c: fix memleak on destroy relay channel 2020-08-21 09:52:53 -07:00
resource.c
rseq.c
scs.c mm: memcontrol: account kernel stack per node 2020-08-07 11:33:25 -07:00
seccomp.c
signal.c task_work: only grab task signal lock when needed 2020-08-13 09:01:38 -06:00
smp.c
smpboot.c
smpboot.h
softirq.c
stackleak.c
stacktrace.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
stop_machine.c
sys_ni.c all arch: remove system call sys_sysctl 2020-08-14 19:56:56 -07:00
sys.c
sysctl-test.c
sysctl.c mm: use unsigned types for fragmentation score 2020-08-12 10:57:56 -07:00
task_work.c task_work: only grab task signal lock when needed 2020-08-13 09:01:38 -06:00
taskstats.c
test_kprobes.c
torture.c
tracepoint.c The main changes in this release include: 2019-07-18 11:51:00 -07:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c kernel: add a kernel_wait helper 2020-08-12 10:57:59 -07:00
up.c
user_namespace.c
user-return-notifier.c
user.c
usermode_driver.c
utsname_sysctl.c
utsname.c
watch_queue.c watch_queue: Limit the number of watches a user can hold 2020-08-17 09:39:18 -07:00
watchdog_hld.c
watchdog.c
workqueue_internal.h
workqueue.c