2017-11-01 21:08:43 +07:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
|
2012-10-13 16:46:48 +07:00
|
|
|
#ifndef _UAPI_LINUX_SCHED_H
|
|
|
|
#define _UAPI_LINUX_SCHED_H
|
|
|
|
|
fork: add clone3
This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
in clone() into a dedicated struct clone_args
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- use a separate kernel internal struct kernel_clone_args that is
properly typed according to current kernel conventions in fork.c and is
different from the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
syscalls such as fork(), vfork(), clone(), and clone3() behave identical
(Arnd suggested, that we can probably also port do_fork() itself in a
separate patchset.)
- ease of transition for userspace from clone() to clone3()
This very much means that we do *not* remove functionality that userspace
currently relies on as the latter is a good way of creating a syscall
that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
It is superseeded by a dedicated "exit_signal" argument in struct
clone_args freeing up space for additional flags.
This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-25 16:36:41 +07:00
|
|
|
#include <linux/types.h>
|
|
|
|
|
2012-10-13 16:46:48 +07:00
|
|
|
/*
|
|
|
|
* cloning flags:
|
|
|
|
*/
|
|
|
|
#define CSIGNAL 0x000000ff /* signal mask to be sent at exit */
|
|
|
|
#define CLONE_VM 0x00000100 /* set if VM shared between processes */
|
|
|
|
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
|
|
|
|
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
|
|
|
|
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
|
2019-03-27 19:04:15 +07:00
|
|
|
#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
|
2012-10-13 16:46:48 +07:00
|
|
|
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
|
|
|
|
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
|
|
|
|
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
|
|
|
|
#define CLONE_THREAD 0x00010000 /* Same thread group? */
|
2014-10-07 16:29:07 +07:00
|
|
|
#define CLONE_NEWNS 0x00020000 /* New mount namespace group */
|
2012-10-13 16:46:48 +07:00
|
|
|
#define CLONE_SYSVSEM 0x00040000 /* share system V SEM_UNDO semantics */
|
|
|
|
#define CLONE_SETTLS 0x00080000 /* create a new TLS for the child */
|
|
|
|
#define CLONE_PARENT_SETTID 0x00100000 /* set the TID in the parent */
|
|
|
|
#define CLONE_CHILD_CLEARTID 0x00200000 /* clear the TID in the child */
|
|
|
|
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
|
|
|
|
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
|
|
|
|
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
|
2016-01-29 15:54:05 +07:00
|
|
|
#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
|
2014-11-04 15:51:22 +07:00
|
|
|
#define CLONE_NEWUTS 0x04000000 /* New utsname namespace */
|
|
|
|
#define CLONE_NEWIPC 0x08000000 /* New ipc namespace */
|
2012-10-13 16:46:48 +07:00
|
|
|
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
|
|
|
|
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
|
|
|
|
#define CLONE_NEWNET 0x40000000 /* New network namespace */
|
|
|
|
#define CLONE_IO 0x80000000 /* Clone io context */
|
|
|
|
|
2019-10-14 17:45:37 +07:00
|
|
|
/* Flags for the clone3() syscall. */
|
|
|
|
#define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
|
2020-02-05 20:26:22 +07:00
|
|
|
#define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
|
2019-10-14 17:45:37 +07:00
|
|
|
|
ns: Introduce Time Namespace
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
2019-11-12 08:26:52 +07:00
|
|
|
/*
|
|
|
|
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3
|
|
|
|
* syscalls only:
|
|
|
|
*/
|
|
|
|
#define CLONE_NEWTIME 0x00000080 /* New time namespace */
|
|
|
|
|
2019-09-17 14:18:53 +07:00
|
|
|
#ifndef __ASSEMBLY__
|
2019-09-27 22:29:05 +07:00
|
|
|
/**
|
|
|
|
* struct clone_args - arguments for the clone3 syscall
|
2019-11-15 19:36:20 +07:00
|
|
|
* @flags: Flags for the new process as listed above.
|
|
|
|
* All flags are valid except for CSIGNAL and
|
|
|
|
* CLONE_DETACHED.
|
|
|
|
* @pidfd: If CLONE_PIDFD is set, a pidfd will be
|
|
|
|
* returned in this argument.
|
|
|
|
* @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
|
|
|
|
* child process will be returned in the child's
|
|
|
|
* memory.
|
|
|
|
* @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
|
|
|
|
* the child process will be returned in the
|
|
|
|
* parent's memory.
|
|
|
|
* @exit_signal: The exit_signal the parent process will be
|
|
|
|
* sent when the child exits.
|
|
|
|
* @stack: Specify the location of the stack for the
|
|
|
|
* child process.
|
2019-11-26 09:36:49 +07:00
|
|
|
* Note, @stack is expected to point to the
|
|
|
|
* lowest address. The stack direction will be
|
|
|
|
* determined by the kernel and set up
|
|
|
|
* appropriately based on @stack_size.
|
2019-11-15 19:36:20 +07:00
|
|
|
* @stack_size: The size of the stack for the child process.
|
|
|
|
* @tls: If CLONE_SETTLS is set, the tls descriptor
|
|
|
|
* is set to tls.
|
|
|
|
* @set_tid: Pointer to an array of type *pid_t. The size
|
|
|
|
* of the array is defined using @set_tid_size.
|
|
|
|
* This array is used to select PIDs/TIDs for
|
|
|
|
* newly created processes. The first element in
|
|
|
|
* this defines the PID in the most nested PID
|
|
|
|
* namespace. Each additional element in the array
|
|
|
|
* defines the PID in the parent PID namespace of
|
|
|
|
* the original PID namespace. If the array has
|
|
|
|
* less entries than the number of currently
|
|
|
|
* nested PID namespaces only the PIDs in the
|
|
|
|
* corresponding namespaces are set.
|
|
|
|
* @set_tid_size: This defines the size of the array referenced
|
|
|
|
* in @set_tid. This cannot be larger than the
|
|
|
|
* kernel's limit of nested PID namespaces.
|
2020-02-05 20:26:22 +07:00
|
|
|
* @cgroup: If CLONE_INTO_CGROUP is specified set this to
|
|
|
|
* a file descriptor for the cgroup.
|
2019-09-27 22:29:05 +07:00
|
|
|
*
|
|
|
|
* The structure is versioned by size and thus extensible.
|
|
|
|
* New struct members must go at the end of the struct and
|
|
|
|
* must be properly 64bit aligned.
|
fork: add clone3
This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
in clone() into a dedicated struct clone_args
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- use a separate kernel internal struct kernel_clone_args that is
properly typed according to current kernel conventions in fork.c and is
different from the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
syscalls such as fork(), vfork(), clone(), and clone3() behave identical
(Arnd suggested, that we can probably also port do_fork() itself in a
separate patchset.)
- ease of transition for userspace from clone() to clone3()
This very much means that we do *not* remove functionality that userspace
currently relies on as the latter is a good way of creating a syscall
that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
It is superseeded by a dedicated "exit_signal" argument in struct
clone_args freeing up space for additional flags.
This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-25 16:36:41 +07:00
|
|
|
*/
|
|
|
|
struct clone_args {
|
|
|
|
__aligned_u64 flags;
|
|
|
|
__aligned_u64 pidfd;
|
|
|
|
__aligned_u64 child_tid;
|
|
|
|
__aligned_u64 parent_tid;
|
|
|
|
__aligned_u64 exit_signal;
|
|
|
|
__aligned_u64 stack;
|
|
|
|
__aligned_u64 stack_size;
|
|
|
|
__aligned_u64 tls;
|
2019-11-15 19:36:20 +07:00
|
|
|
__aligned_u64 set_tid;
|
|
|
|
__aligned_u64 set_tid_size;
|
2020-02-05 20:26:22 +07:00
|
|
|
__aligned_u64 cgroup;
|
fork: add clone3
This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
in clone() into a dedicated struct clone_args
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- use a separate kernel internal struct kernel_clone_args that is
properly typed according to current kernel conventions in fork.c and is
different from the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
syscalls such as fork(), vfork(), clone(), and clone3() behave identical
(Arnd suggested, that we can probably also port do_fork() itself in a
separate patchset.)
- ease of transition for userspace from clone() to clone3()
This very much means that we do *not* remove functionality that userspace
currently relies on as the latter is a good way of creating a syscall
that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
It is superseeded by a dedicated "exit_signal" argument in struct
clone_args freeing up space for additional flags.
This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-25 16:36:41 +07:00
|
|
|
};
|
2019-09-17 14:18:53 +07:00
|
|
|
#endif
|
fork: add clone3
This adds the clone3 system call.
As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().
We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().
Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.
The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
in clone() into a dedicated struct clone_args
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- use a separate kernel internal struct kernel_clone_args that is
properly typed according to current kernel conventions in fork.c and is
different from the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
syscalls such as fork(), vfork(), clone(), and clone3() behave identical
(Arnd suggested, that we can probably also port do_fork() itself in a
separate patchset.)
- ease of transition for userspace from clone() to clone3()
This very much means that we do *not* remove functionality that userspace
currently relies on as the latter is a good way of creating a syscall
that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible
In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:
/* uapi */
struct clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};
/* kernel internal */
struct kernel_clone_args {
u64 flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
int exit_signal;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
};
long sys_clone3(struct clone_args __user *uargs, size_t size)
clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().
There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
It is superseeded by a dedicated "exit_signal" argument in struct
clone_args freeing up space for additional flags.
This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-05-25 16:36:41 +07:00
|
|
|
|
2019-10-01 08:10:53 +07:00
|
|
|
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
|
2019-11-15 19:36:20 +07:00
|
|
|
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
|
2020-02-05 20:26:22 +07:00
|
|
|
#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
|
2019-10-01 08:10:53 +07:00
|
|
|
|
2012-10-13 16:46:48 +07:00
|
|
|
/*
|
|
|
|
* Scheduling policies
|
|
|
|
*/
|
|
|
|
#define SCHED_NORMAL 0
|
|
|
|
#define SCHED_FIFO 1
|
|
|
|
#define SCHED_RR 2
|
|
|
|
#define SCHED_BATCH 3
|
|
|
|
/* SCHED_ISO: reserved but not implemented yet */
|
|
|
|
#define SCHED_IDLE 5
|
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-28 17:14:43 +07:00
|
|
|
#define SCHED_DEADLINE 6
|
2014-01-15 23:05:04 +07:00
|
|
|
|
2012-10-13 16:46:48 +07:00
|
|
|
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
|
|
|
|
#define SCHED_RESET_ON_FORK 0x40000000
|
|
|
|
|
2014-01-15 23:05:04 +07:00
|
|
|
/*
|
|
|
|
* For the sched_{set,get}attr() calls
|
|
|
|
*/
|
|
|
|
#define SCHED_FLAG_RESET_ON_FORK 0x01
|
2017-05-19 03:13:33 +07:00
|
|
|
#define SCHED_FLAG_RECLAIM 0x02
|
2017-12-12 18:10:24 +07:00
|
|
|
#define SCHED_FLAG_DL_OVERRUN 0x04
|
2019-06-21 15:42:06 +07:00
|
|
|
#define SCHED_FLAG_KEEP_POLICY 0x08
|
2019-06-21 15:42:07 +07:00
|
|
|
#define SCHED_FLAG_KEEP_PARAMS 0x10
|
|
|
|
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
|
|
|
|
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
|
|
|
|
|
|
|
|
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
|
|
|
|
SCHED_FLAG_KEEP_PARAMS)
|
|
|
|
|
|
|
|
#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
|
|
|
|
SCHED_FLAG_UTIL_CLAMP_MAX)
|
2017-12-12 18:10:24 +07:00
|
|
|
|
|
|
|
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
|
|
|
|
SCHED_FLAG_RECLAIM | \
|
2019-06-21 15:42:06 +07:00
|
|
|
SCHED_FLAG_DL_OVERRUN | \
|
2019-06-21 15:42:07 +07:00
|
|
|
SCHED_FLAG_KEEP_ALL | \
|
|
|
|
SCHED_FLAG_UTIL_CLAMP)
|
2012-10-13 16:46:48 +07:00
|
|
|
|
|
|
|
#endif /* _UAPI_LINUX_SCHED_H */
|