mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2024-12-28 11:18:45 +07:00
f606b77f1a
During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test for sanity of values and same time allow us to support checkpoint/restore of user namespaces. Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of prctl_mm_map structure which carries all the members to be updated. prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size) struct prctl_mm_map { __u64 start_code; __u64 end_code; __u64 start_data; __u64 end_data; __u64 start_brk; __u64 brk; __u64 start_stack; __u64 arg_start; __u64 arg_end; __u64 env_start; __u64 env_end; __u64 *auxv; __u32 auxv_size; __u32 exe_fd; }; All members except @exe_fd correspond ones of struct mm_struct. To figure out which available values these members may take here are meanings of the members. - start_code, end_code: represent bounds of executable code area - start_data, end_data: represent bounds of data area - start_brk, brk: used to calculate bounds for brk() syscall - start_stack: used when accounting space needed for command line arguments, environment and shmat() syscall - arg_start, arg_end, env_start, env_end: represent memory area supplied for command line arguments and environment variables - auxv, auxv_size: carries auxiliary vector, Elf format specifics - exe_fd: file descriptor number for executable link (/proc/self/exe) Thus we apply the following requirements to the values 1) Any member except @auxv, @auxv_size, @exe_fd is rather an address in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr) interval. 2) While @[start|end]_code and @[start|end]_data may point to an nonexisting VMAs (say a program maps own new .text and .data segments during execution) the rest of members should belong to VMA which must exist. 3) Addresses must be ordered, ie @start_ member must not be greater or equal to appropriate @end_ member. 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. 5) If RLIMIT_DATA rlimit is set to non-infinity new values should not exceed existing limit. Same applies to RLIMIT_STACK. 6) Auxiliary vector size must not exceed existing one (which is predefined as AT_VECTOR_SIZE and depends on architecture). 7) File descriptor passed in @exe_file should be pointing to executable file (because we use existing prctl_set_mm_exe_file_locked helper it ensures that the file we are going to use as exe link has all required permission granted). Now about where these members are involved inside kernel code: - @start_code and @end_code are used in /proc/$pid/[stat|statm] output; - @start_data and @end_data are used in /proc/$pid/[stat|statm] output, also they are considered if there enough space for brk() syscall result if RLIMIT_DATA is set; - @start_brk shown in /proc/$pid/stat output and accounted in brk() syscall if RLIMIT_DATA is set; also this member is tested to find a symbolic name of mmap event for perf system (we choose if event is generated for "heap" area); one more aplication is selinux -- we test if a process has PROCESS__EXECHEAP permission if trying to make heap area being executable with mprotect() syscall; - @brk is a current value for brk() syscall which lays inside heap area, it's shown in /proc/$pid/stat. When syscall brk() succesfully provides new memory area to a user space upon brk() completion the mm::brk is updated to carry new value; Both @start_brk and @brk are actively used in /proc/$pid/maps and /proc/$pid/smaps output to find a symbolic name "heap" for VMA being scanned; - @start_stack is printed out in /proc/$pid/stat and used to find a symbolic name "stack" for task and threads in /proc/$pid/maps and /proc/$pid/smaps output, and as the same as with @start_brk -- perf system uses it for event naming. Also kernel treat this member as a start address of where to map vDSO pages and to check if there is enough space for shmat() syscall; - @arg_start, @arg_end, @env_start and @env_end are printed out in /proc/$pid/stat. Another access to the data these members represent is to read /proc/$pid/environ or /proc/$pid/cmdline. Any attempt to read these areas kernel tests with access_process_vm helper so a user must have enough rights for this action; - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly speaking kernel doesn't care much about which exactly data is sitting there because it is solely for userspace; - @exe_fd is referred from /proc/$pid/exe and when generating coredump. We uses prctl_set_mm_exe_file_locked helper to update this member, so exe-file link modification remains one-shot action. Still note that updating exe-file link now doesn't require sys-resource capability anymore, after all there is no much profit in preventing setup own file link (there are a number of ways to execute own code -- ptrace, ld-preload, so that the only reliable way to find which exactly code is executed is to inspect running program memory). Still we require the caller to be at least user-namespace root user. I believe the old interface should be deprecated and ripped off in a couple of kernel releases if no one against. To test if new interface is implemented in the kernel one can pass PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently supported struct prctl_mm_map. [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Andrew Vagin <avagin@openvz.org> Tested-by: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
183 lines
5.9 KiB
C
183 lines
5.9 KiB
C
#ifndef _LINUX_PRCTL_H
|
|
#define _LINUX_PRCTL_H
|
|
|
|
#include <linux/types.h>
|
|
|
|
/* Values to pass as first argument to prctl() */
|
|
|
|
#define PR_SET_PDEATHSIG 1 /* Second arg is a signal */
|
|
#define PR_GET_PDEATHSIG 2 /* Second arg is a ptr to return the signal */
|
|
|
|
/* Get/set current->mm->dumpable */
|
|
#define PR_GET_DUMPABLE 3
|
|
#define PR_SET_DUMPABLE 4
|
|
|
|
/* Get/set unaligned access control bits (if meaningful) */
|
|
#define PR_GET_UNALIGN 5
|
|
#define PR_SET_UNALIGN 6
|
|
# define PR_UNALIGN_NOPRINT 1 /* silently fix up unaligned user accesses */
|
|
# define PR_UNALIGN_SIGBUS 2 /* generate SIGBUS on unaligned user access */
|
|
|
|
/* Get/set whether or not to drop capabilities on setuid() away from
|
|
* uid 0 (as per security/commoncap.c) */
|
|
#define PR_GET_KEEPCAPS 7
|
|
#define PR_SET_KEEPCAPS 8
|
|
|
|
/* Get/set floating-point emulation control bits (if meaningful) */
|
|
#define PR_GET_FPEMU 9
|
|
#define PR_SET_FPEMU 10
|
|
# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */
|
|
# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */
|
|
|
|
/* Get/set floating-point exception mode (if meaningful) */
|
|
#define PR_GET_FPEXC 11
|
|
#define PR_SET_FPEXC 12
|
|
# define PR_FP_EXC_SW_ENABLE 0x80 /* Use FPEXC for FP exception enables */
|
|
# define PR_FP_EXC_DIV 0x010000 /* floating point divide by zero */
|
|
# define PR_FP_EXC_OVF 0x020000 /* floating point overflow */
|
|
# define PR_FP_EXC_UND 0x040000 /* floating point underflow */
|
|
# define PR_FP_EXC_RES 0x080000 /* floating point inexact result */
|
|
# define PR_FP_EXC_INV 0x100000 /* floating point invalid operation */
|
|
# define PR_FP_EXC_DISABLED 0 /* FP exceptions disabled */
|
|
# define PR_FP_EXC_NONRECOV 1 /* async non-recoverable exc. mode */
|
|
# define PR_FP_EXC_ASYNC 2 /* async recoverable exception mode */
|
|
# define PR_FP_EXC_PRECISE 3 /* precise exception mode */
|
|
|
|
/* Get/set whether we use statistical process timing or accurate timestamp
|
|
* based process timing */
|
|
#define PR_GET_TIMING 13
|
|
#define PR_SET_TIMING 14
|
|
# define PR_TIMING_STATISTICAL 0 /* Normal, traditional,
|
|
statistical process timing */
|
|
# define PR_TIMING_TIMESTAMP 1 /* Accurate timestamp based
|
|
process timing */
|
|
|
|
#define PR_SET_NAME 15 /* Set process name */
|
|
#define PR_GET_NAME 16 /* Get process name */
|
|
|
|
/* Get/set process endian */
|
|
#define PR_GET_ENDIAN 19
|
|
#define PR_SET_ENDIAN 20
|
|
# define PR_ENDIAN_BIG 0
|
|
# define PR_ENDIAN_LITTLE 1 /* True little endian mode */
|
|
# define PR_ENDIAN_PPC_LITTLE 2 /* "PowerPC" pseudo little endian */
|
|
|
|
/* Get/set process seccomp mode */
|
|
#define PR_GET_SECCOMP 21
|
|
#define PR_SET_SECCOMP 22
|
|
|
|
/* Get/set the capability bounding set (as per security/commoncap.c) */
|
|
#define PR_CAPBSET_READ 23
|
|
#define PR_CAPBSET_DROP 24
|
|
|
|
/* Get/set the process' ability to use the timestamp counter instruction */
|
|
#define PR_GET_TSC 25
|
|
#define PR_SET_TSC 26
|
|
# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */
|
|
# define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */
|
|
|
|
/* Get/set securebits (as per security/commoncap.c) */
|
|
#define PR_GET_SECUREBITS 27
|
|
#define PR_SET_SECUREBITS 28
|
|
|
|
/*
|
|
* Get/set the timerslack as used by poll/select/nanosleep
|
|
* A value of 0 means "use default"
|
|
*/
|
|
#define PR_SET_TIMERSLACK 29
|
|
#define PR_GET_TIMERSLACK 30
|
|
|
|
#define PR_TASK_PERF_EVENTS_DISABLE 31
|
|
#define PR_TASK_PERF_EVENTS_ENABLE 32
|
|
|
|
/*
|
|
* Set early/late kill mode for hwpoison memory corruption.
|
|
* This influences when the process gets killed on a memory corruption.
|
|
*/
|
|
#define PR_MCE_KILL 33
|
|
# define PR_MCE_KILL_CLEAR 0
|
|
# define PR_MCE_KILL_SET 1
|
|
|
|
# define PR_MCE_KILL_LATE 0
|
|
# define PR_MCE_KILL_EARLY 1
|
|
# define PR_MCE_KILL_DEFAULT 2
|
|
|
|
#define PR_MCE_KILL_GET 34
|
|
|
|
/*
|
|
* Tune up process memory map specifics.
|
|
*/
|
|
#define PR_SET_MM 35
|
|
# define PR_SET_MM_START_CODE 1
|
|
# define PR_SET_MM_END_CODE 2
|
|
# define PR_SET_MM_START_DATA 3
|
|
# define PR_SET_MM_END_DATA 4
|
|
# define PR_SET_MM_START_STACK 5
|
|
# define PR_SET_MM_START_BRK 6
|
|
# define PR_SET_MM_BRK 7
|
|
# define PR_SET_MM_ARG_START 8
|
|
# define PR_SET_MM_ARG_END 9
|
|
# define PR_SET_MM_ENV_START 10
|
|
# define PR_SET_MM_ENV_END 11
|
|
# define PR_SET_MM_AUXV 12
|
|
# define PR_SET_MM_EXE_FILE 13
|
|
# define PR_SET_MM_MAP 14
|
|
# define PR_SET_MM_MAP_SIZE 15
|
|
|
|
/*
|
|
* This structure provides new memory descriptor
|
|
* map which mostly modifies /proc/pid/stat[m]
|
|
* output for a task. This mostly done in a
|
|
* sake of checkpoint/restore functionality.
|
|
*/
|
|
struct prctl_mm_map {
|
|
__u64 start_code; /* code section bounds */
|
|
__u64 end_code;
|
|
__u64 start_data; /* data section bounds */
|
|
__u64 end_data;
|
|
__u64 start_brk; /* heap for brk() syscall */
|
|
__u64 brk;
|
|
__u64 start_stack; /* stack starts at */
|
|
__u64 arg_start; /* command line arguments bounds */
|
|
__u64 arg_end;
|
|
__u64 env_start; /* environment variables bounds */
|
|
__u64 env_end;
|
|
__u64 *auxv; /* auxiliary vector */
|
|
__u32 auxv_size; /* vector size */
|
|
__u32 exe_fd; /* /proc/$pid/exe link file */
|
|
};
|
|
|
|
/*
|
|
* Set specific pid that is allowed to ptrace the current task.
|
|
* A value of 0 mean "no process".
|
|
*/
|
|
#define PR_SET_PTRACER 0x59616d61
|
|
# define PR_SET_PTRACER_ANY ((unsigned long)-1)
|
|
|
|
#define PR_SET_CHILD_SUBREAPER 36
|
|
#define PR_GET_CHILD_SUBREAPER 37
|
|
|
|
/*
|
|
* If no_new_privs is set, then operations that grant new privileges (i.e.
|
|
* execve) will either fail or not grant them. This affects suid/sgid,
|
|
* file capabilities, and LSMs.
|
|
*
|
|
* Operations that merely manipulate or drop existing privileges (setresuid,
|
|
* capset, etc.) will still work. Drop those privileges if you want them gone.
|
|
*
|
|
* Changing LSM security domain is considered a new privilege. So, for example,
|
|
* asking selinux for a specific new context (e.g. with runcon) will result
|
|
* in execve returning -EPERM.
|
|
*
|
|
* See Documentation/prctl/no_new_privs.txt for more details.
|
|
*/
|
|
#define PR_SET_NO_NEW_PRIVS 38
|
|
#define PR_GET_NO_NEW_PRIVS 39
|
|
|
|
#define PR_GET_TID_ADDRESS 40
|
|
|
|
#define PR_SET_THP_DISABLE 41
|
|
#define PR_GET_THP_DISABLE 42
|
|
|
|
#endif /* _LINUX_PRCTL_H */
|