linux_dsm_epyc7002/fs
David Herrmann 40e041a2c8 shm: add sealing API
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement.  However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies.  Look at the following two
use-cases:

  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING.  If you seal a file, a
specific set of operations is blocked on that file forever.  Unlike locks,
seals can only be set, never removed.  Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
..
9p
adfs adfs: add __printf verification, fix format/argument mismatches 2014-08-08 15:57:24 -07:00
affs
afs AFS: Correctly assemble the client UUID 2014-07-29 10:14:36 -07:00
autofs4 autofs4: comment typo: remove a a doubled word 2014-08-08 15:57:19 -07:00
befs fs/befs/linuxvfs.c: check superblock before dump operation 2014-08-08 15:57:20 -07:00
bfs fs/bfs: use bfs prefix for dump_imap 2014-08-08 15:57:24 -07:00
btrfs Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes 2014-07-28 10:03:00 +02:00
cachefiles
ceph
cifs sched: Allow wait_on_bit_action() functions to support a timeout 2014-07-16 15:10:41 +02:00
coda fs/coda: use linux/uaccess.h 2014-08-08 15:57:20 -07:00
configfs
cramfs fs/cramfs/inode.c: use linux/uaccess.h 2014-08-08 15:57:25 -07:00
debugfs fs: debugfs: remove trailing whitespace 2014-07-09 16:58:21 -07:00
devpts
dlm fs/dlm/debug_fs.c: remove unnecessary null test before debugfs_remove 2014-08-08 15:57:27 -07:00
ecryptfs
efivarfs
efs fs/efs/namei.c: return is not a function 2014-08-08 15:57:18 -07:00
exofs fs/exofs/ore_raid.c: replace count*size kzalloc by kcalloc 2014-08-08 15:57:24 -07:00
exportfs
ext2
ext3
ext4 ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct 2014-07-30 22:17:17 -04:00
f2fs f2fs: use for_each_set_bit to simplify the code 2014-08-04 13:20:53 -07:00
fat
freevxfs
fscache fs/fscache: make ctl_table static 2014-08-06 18:01:12 -07:00
fuse fuse: add FUSE_NO_OPEN_SUPPORT flag to INIT 2014-07-22 16:37:43 +02:00
gfs2 Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes 2014-07-28 10:03:00 +02:00
hfs
hfsplus
hostfs
hpfs fs/hpfs/dnode.c: fix suspect code indent 2014-08-08 15:57:22 -07:00
hppfs
hugetlbfs
isofs initramfs: support initramfs that is bigger than 2GiB 2014-08-08 15:57:26 -07:00
jbd
jbd2 sched: Remove proliferation of wait_on_bit() action functions 2014-07-16 15:10:39 +02:00
jffs2 initramfs: support initramfs that is bigger than 2GiB 2014-08-08 15:57:26 -07:00
jfs
kernfs Merge 3.16-rc6 into driver-core-next 2014-07-21 10:07:25 -07:00
lockd fs: lockd: Use ktime_get_ns() 2014-07-23 15:01:44 -07:00
logfs fs/logfs/readwrite.c: kernel-doc warning fixes 2014-08-06 18:01:12 -07:00
minix minix zmap block counts calculation fix 2014-08-08 15:57:20 -07:00
ncpfs
nfs Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2014-08-06 08:06:39 -07:00
nfs_common
nfsd NFSD: Fix crash encoding lock reply on 32-bit 2014-07-23 10:31:56 -04:00
nilfs2 nilfs2: integrate sysfs support into driver 2014-08-08 15:57:21 -07:00
nls
notify list: fix order of arguments for hlist_add_after(_rcu) 2014-08-06 18:01:24 -07:00
ntfs ntfs: kernel-doc warning fixes 2014-08-06 18:01:12 -07:00
ocfs2 fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc 2014-08-06 18:01:13 -07:00
omfs fs/omfs/inode.c: replace count*size kzalloc by kcalloc 2014-08-08 15:57:25 -07:00
openpromfs
proc sysctl: remove typedef ctl_table 2014-08-08 15:57:24 -07:00
pstore fs/pstore/ram_core.c: replace count*size kmalloc by kmalloc_array 2014-08-08 15:57:25 -07:00
qnx4
qnx6 fs/qnx6: update debugging to current functions 2014-08-08 15:57:26 -07:00
quota quota: missing lock in dqcache_shrink_scan() 2014-07-15 22:36:18 +02:00
ramfs fs/ramfs/file-nommu.c: replace count*size kzalloc by kcalloc 2014-08-08 15:57:18 -07:00
reiserfs fs/reiserfs/xattr.c: fix blank line missing after declarations 2014-08-08 15:57:22 -07:00
romfs fs/romfs/super.c: add blank line after declarations 2014-08-08 15:57:25 -07:00
squashfs fs/squashfs/super.c: logging cleanup 2014-08-06 18:01:13 -07:00
sysfs
sysv
ubifs
udf
ufs fs/ufs/inode.c: kernel-doc warning fixes 2014-08-08 15:57:21 -07:00
xfs xfs: null unused quota inodes when quota is on 2014-07-15 07:28:41 +10:00
aio.c Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-08-04 10:09:27 -07:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
block_dev.c
buffer.c sched: Remove proliferation of wait_on_bit() action functions 2014-07-16 15:10:39 +02:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Bluetooth: Move HCI socket definitions into its own header file 2014-07-11 13:53:04 +03:00
compat.c
coredump.c coredump: fix the setting of PF_DUMPCORE 2014-07-23 15:10:54 -07:00
dcache.c
dcookies.c
direct-io.c direct-io: fix AIO regression 2014-08-01 02:35:51 -04:00
drop_caches.c
eventfd.c
eventpoll.c
exec.c fork/exec: cleanup mm initialization 2014-08-08 15:57:23 -07:00
fcntl.c shm: add sealing API 2014-08-08 15:57:31 -07:00
fhandle.c
file_table.c
file.c
filesystems.c
fs_struct.c
fs-writeback.c sched: Remove proliferation of wait_on_bit() action functions 2014-07-16 15:10:39 +02:00
inode.c mm: allow drivers to prevent new writable mappings 2014-08-08 15:57:31 -07:00
internal.h
ioctl.c
Kconfig
Kconfig.binfmt
libfs.c
locks.c locks: purge fl_owner_t from fs/locks.c 2014-07-13 21:39:07 -04:00
Makefile
mbcache.c
mount.h
mpage.c
namei.c fs: umount on symlink leaks mnt count 2014-07-24 06:18:12 -04:00
namespace.c list: fix order of arguments for hlist_add_after(_rcu) 2014-08-06 18:01:24 -07:00
no-block.c
open.c vfs: fix check for fallocate on active swapfile 2014-08-01 02:36:04 -04:00
pipe.c
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c
readdir.c
select.c
seq_file.c
signalfd.c
splice.c
stack.c
stat.c
statfs.c
super.c
sync.c
timerfd.c timerfd: Use ktime_mono_to_real() 2014-07-23 10:18:02 -07:00
utimes.c
xattr.c simple_xattr: permit 0-size extended attributes 2014-07-23 15:10:55 -07:00