mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2025-01-15 14:58:02 +07:00
6c32978414
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0 Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12 72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7 GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops= =YTGf -----END PGP SIGNATURE----- Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull notification queue from David Howells: "This adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [ There are other notification pending: mount/sb fsinfo notifications for libmount that Karel Zak and Ian Kent have been working on, and Christian Brauner would like to use them in lxc, but let's see how this one works first ] LSM hooks are included: - A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] - A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ - The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice&co from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). - It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. - sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. - The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. - Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. - Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. - Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. - When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. - The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. - Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: - Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). - Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. - Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== - The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch - A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout" * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: smack: Implement the watch_key and post_notification hooks selinux: Implement the watch_key security hook keys: Make the KEY_NEED_* perms an enum rather than a mask pipe: Add notification lossage handling pipe: Allow buffers to be marked read-whole-or-error for notifications Add sample notification program watch_queue: Add a key/keyring notification facility security: Add hooks to rule on setting a watch pipe: Add general notification queue support pipe: Add O_NOTIFICATION_PIPE security: Add a hook for the point of notification insertion uapi: General notification queue definitions
275 lines
8.5 KiB
C
275 lines
8.5 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_PIPE_FS_I_H
|
|
#define _LINUX_PIPE_FS_I_H
|
|
|
|
#define PIPE_DEF_BUFFERS 16
|
|
|
|
#define PIPE_BUF_FLAG_LRU 0x01 /* page is on the LRU */
|
|
#define PIPE_BUF_FLAG_ATOMIC 0x02 /* was atomically mapped */
|
|
#define PIPE_BUF_FLAG_GIFT 0x04 /* page is a gift */
|
|
#define PIPE_BUF_FLAG_PACKET 0x08 /* read() as a packet */
|
|
#define PIPE_BUF_FLAG_CAN_MERGE 0x10 /* can merge buffers */
|
|
#define PIPE_BUF_FLAG_WHOLE 0x20 /* read() must return entire buffer or error */
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
#define PIPE_BUF_FLAG_LOSS 0x40 /* Message loss happened after this buffer */
|
|
#endif
|
|
|
|
/**
|
|
* struct pipe_buffer - a linux kernel pipe buffer
|
|
* @page: the page containing the data for the pipe buffer
|
|
* @offset: offset of data inside the @page
|
|
* @len: length of data inside the @page
|
|
* @ops: operations associated with this buffer. See @pipe_buf_operations.
|
|
* @flags: pipe buffer flags. See above.
|
|
* @private: private data owned by the ops.
|
|
**/
|
|
struct pipe_buffer {
|
|
struct page *page;
|
|
unsigned int offset, len;
|
|
const struct pipe_buf_operations *ops;
|
|
unsigned int flags;
|
|
unsigned long private;
|
|
};
|
|
|
|
/**
|
|
* struct pipe_inode_info - a linux kernel pipe
|
|
* @mutex: mutex protecting the whole thing
|
|
* @rd_wait: reader wait point in case of empty pipe
|
|
* @wr_wait: writer wait point in case of full pipe
|
|
* @head: The point of buffer production
|
|
* @tail: The point of buffer consumption
|
|
* @note_loss: The next read() should insert a data-lost message
|
|
* @max_usage: The maximum number of slots that may be used in the ring
|
|
* @ring_size: total number of buffers (should be a power of 2)
|
|
* @nr_accounted: The amount this pipe accounts for in user->pipe_bufs
|
|
* @tmp_page: cached released page
|
|
* @readers: number of current readers of this pipe
|
|
* @writers: number of current writers of this pipe
|
|
* @files: number of struct file referring this pipe (protected by ->i_lock)
|
|
* @r_counter: reader counter
|
|
* @w_counter: writer counter
|
|
* @fasync_readers: reader side fasync
|
|
* @fasync_writers: writer side fasync
|
|
* @bufs: the circular array of pipe buffers
|
|
* @user: the user who created this pipe
|
|
* @watch_queue: If this pipe is a watch_queue, this is the stuff for that
|
|
**/
|
|
struct pipe_inode_info {
|
|
struct mutex mutex;
|
|
wait_queue_head_t rd_wait, wr_wait;
|
|
unsigned int head;
|
|
unsigned int tail;
|
|
unsigned int max_usage;
|
|
unsigned int ring_size;
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
bool note_loss;
|
|
#endif
|
|
unsigned int nr_accounted;
|
|
unsigned int readers;
|
|
unsigned int writers;
|
|
unsigned int files;
|
|
unsigned int r_counter;
|
|
unsigned int w_counter;
|
|
struct page *tmp_page;
|
|
struct fasync_struct *fasync_readers;
|
|
struct fasync_struct *fasync_writers;
|
|
struct pipe_buffer *bufs;
|
|
struct user_struct *user;
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
struct watch_queue *watch_queue;
|
|
#endif
|
|
};
|
|
|
|
/*
|
|
* Note on the nesting of these functions:
|
|
*
|
|
* ->confirm()
|
|
* ->try_steal()
|
|
*
|
|
* That is, ->try_steal() must be called on a confirmed buffer. See below for
|
|
* the meaning of each operation. Also see the kerneldoc in fs/pipe.c for the
|
|
* pipe and generic variants of these hooks.
|
|
*/
|
|
struct pipe_buf_operations {
|
|
/*
|
|
* ->confirm() verifies that the data in the pipe buffer is there
|
|
* and that the contents are good. If the pages in the pipe belong
|
|
* to a file system, we may need to wait for IO completion in this
|
|
* hook. Returns 0 for good, or a negative error value in case of
|
|
* error. If not present all pages are considered good.
|
|
*/
|
|
int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *);
|
|
|
|
/*
|
|
* When the contents of this pipe buffer has been completely
|
|
* consumed by a reader, ->release() is called.
|
|
*/
|
|
void (*release)(struct pipe_inode_info *, struct pipe_buffer *);
|
|
|
|
/*
|
|
* Attempt to take ownership of the pipe buffer and its contents.
|
|
* ->try_steal() returns %true for success, in which case the contents
|
|
* of the pipe (the buf->page) is locked and now completely owned by the
|
|
* caller. The page may then be transferred to a different mapping, the
|
|
* most often used case is insertion into different file address space
|
|
* cache.
|
|
*/
|
|
bool (*try_steal)(struct pipe_inode_info *, struct pipe_buffer *);
|
|
|
|
/*
|
|
* Get a reference to the pipe buffer.
|
|
*/
|
|
bool (*get)(struct pipe_inode_info *, struct pipe_buffer *);
|
|
};
|
|
|
|
/**
|
|
* pipe_empty - Return true if the pipe is empty
|
|
* @head: The pipe ring head pointer
|
|
* @tail: The pipe ring tail pointer
|
|
*/
|
|
static inline bool pipe_empty(unsigned int head, unsigned int tail)
|
|
{
|
|
return head == tail;
|
|
}
|
|
|
|
/**
|
|
* pipe_occupancy - Return number of slots used in the pipe
|
|
* @head: The pipe ring head pointer
|
|
* @tail: The pipe ring tail pointer
|
|
*/
|
|
static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
|
|
{
|
|
return head - tail;
|
|
}
|
|
|
|
/**
|
|
* pipe_full - Return true if the pipe is full
|
|
* @head: The pipe ring head pointer
|
|
* @tail: The pipe ring tail pointer
|
|
* @limit: The maximum amount of slots available.
|
|
*/
|
|
static inline bool pipe_full(unsigned int head, unsigned int tail,
|
|
unsigned int limit)
|
|
{
|
|
return pipe_occupancy(head, tail) >= limit;
|
|
}
|
|
|
|
/**
|
|
* pipe_space_for_user - Return number of slots available to userspace
|
|
* @head: The pipe ring head pointer
|
|
* @tail: The pipe ring tail pointer
|
|
* @pipe: The pipe info structure
|
|
*/
|
|
static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int tail,
|
|
struct pipe_inode_info *pipe)
|
|
{
|
|
unsigned int p_occupancy, p_space;
|
|
|
|
p_occupancy = pipe_occupancy(head, tail);
|
|
if (p_occupancy >= pipe->max_usage)
|
|
return 0;
|
|
p_space = pipe->ring_size - p_occupancy;
|
|
if (p_space > pipe->max_usage)
|
|
p_space = pipe->max_usage;
|
|
return p_space;
|
|
}
|
|
|
|
/**
|
|
* pipe_buf_get - get a reference to a pipe_buffer
|
|
* @pipe: the pipe that the buffer belongs to
|
|
* @buf: the buffer to get a reference to
|
|
*
|
|
* Return: %true if the reference was successfully obtained.
|
|
*/
|
|
static inline __must_check bool pipe_buf_get(struct pipe_inode_info *pipe,
|
|
struct pipe_buffer *buf)
|
|
{
|
|
return buf->ops->get(pipe, buf);
|
|
}
|
|
|
|
/**
|
|
* pipe_buf_release - put a reference to a pipe_buffer
|
|
* @pipe: the pipe that the buffer belongs to
|
|
* @buf: the buffer to put a reference to
|
|
*/
|
|
static inline void pipe_buf_release(struct pipe_inode_info *pipe,
|
|
struct pipe_buffer *buf)
|
|
{
|
|
const struct pipe_buf_operations *ops = buf->ops;
|
|
|
|
buf->ops = NULL;
|
|
ops->release(pipe, buf);
|
|
}
|
|
|
|
/**
|
|
* pipe_buf_confirm - verify contents of the pipe buffer
|
|
* @pipe: the pipe that the buffer belongs to
|
|
* @buf: the buffer to confirm
|
|
*/
|
|
static inline int pipe_buf_confirm(struct pipe_inode_info *pipe,
|
|
struct pipe_buffer *buf)
|
|
{
|
|
if (!buf->ops->confirm)
|
|
return 0;
|
|
return buf->ops->confirm(pipe, buf);
|
|
}
|
|
|
|
/**
|
|
* pipe_buf_try_steal - attempt to take ownership of a pipe_buffer
|
|
* @pipe: the pipe that the buffer belongs to
|
|
* @buf: the buffer to attempt to steal
|
|
*/
|
|
static inline bool pipe_buf_try_steal(struct pipe_inode_info *pipe,
|
|
struct pipe_buffer *buf)
|
|
{
|
|
if (!buf->ops->try_steal)
|
|
return false;
|
|
return buf->ops->try_steal(pipe, buf);
|
|
}
|
|
|
|
/* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
|
|
memory allocation, whereas PIPE_BUF makes atomicity guarantees. */
|
|
#define PIPE_SIZE PAGE_SIZE
|
|
|
|
/* Pipe lock and unlock operations */
|
|
void pipe_lock(struct pipe_inode_info *);
|
|
void pipe_unlock(struct pipe_inode_info *);
|
|
void pipe_double_lock(struct pipe_inode_info *, struct pipe_inode_info *);
|
|
|
|
extern unsigned int pipe_max_size;
|
|
extern unsigned long pipe_user_pages_hard;
|
|
extern unsigned long pipe_user_pages_soft;
|
|
|
|
/* Drop the inode semaphore and wait for a pipe event, atomically */
|
|
void pipe_wait(struct pipe_inode_info *pipe);
|
|
|
|
struct pipe_inode_info *alloc_pipe_info(void);
|
|
void free_pipe_info(struct pipe_inode_info *);
|
|
|
|
/* Generic pipe buffer ops functions */
|
|
bool generic_pipe_buf_get(struct pipe_inode_info *, struct pipe_buffer *);
|
|
bool generic_pipe_buf_try_steal(struct pipe_inode_info *, struct pipe_buffer *);
|
|
void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
|
|
|
|
extern const struct pipe_buf_operations nosteal_pipe_buf_ops;
|
|
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
unsigned long account_pipe_buffers(struct user_struct *user,
|
|
unsigned long old, unsigned long new);
|
|
bool too_many_pipe_buffers_soft(unsigned long user_bufs);
|
|
bool too_many_pipe_buffers_hard(unsigned long user_bufs);
|
|
bool pipe_is_unprivileged_user(void);
|
|
#endif
|
|
|
|
/* for F_SETPIPE_SZ and F_GETPIPE_SZ */
|
|
#ifdef CONFIG_WATCH_QUEUE
|
|
int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots);
|
|
#endif
|
|
long pipe_fcntl(struct file *, unsigned int, unsigned long arg);
|
|
struct pipe_inode_info *get_pipe_info(struct file *file, bool for_splice);
|
|
|
|
int create_pipe_files(struct file **, int);
|
|
unsigned int round_pipe_size(unsigned long size);
|
|
|
|
#endif
|