2017-11-07 23:30:05 +07:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-17 05:20:36 +07:00
|
|
|
/*
|
|
|
|
* kernel userspace event delivery
|
|
|
|
*
|
|
|
|
* Copyright (C) 2004 Red Hat, Inc. All rights reserved.
|
|
|
|
* Copyright (C) 2004 Novell, Inc. All rights reserved.
|
|
|
|
* Copyright (C) 2004 IBM, Inc. All rights reserved.
|
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
* Robert Love <rml@novell.com>
|
|
|
|
* Kay Sievers <kay.sievers@vrfy.org>
|
|
|
|
* Arjan van de Ven <arjanv@redhat.com>
|
|
|
|
* Greg Kroah-Hartman <greg@kroah.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/spinlock.h>
|
2008-03-28 04:26:30 +07:00
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/kobject.h>
|
2011-11-17 09:29:17 +07:00
|
|
|
#include <linux/export.h>
|
|
|
|
#include <linux/kmod.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 15:04:11 +07:00
|
|
|
#include <linux/slab.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <linux/socket.h>
|
|
|
|
#include <linux/skbuff.h>
|
|
|
|
#include <linux/netlink.h>
|
2018-04-29 17:44:11 +07:00
|
|
|
#include <linux/uidgid.h>
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
#include <linux/uuid.h>
|
|
|
|
#include <linux/ctype.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
#include <net/sock.h>
|
2018-03-19 19:17:31 +07:00
|
|
|
#include <net/netlink.h>
|
2010-05-05 07:36:44 +07:00
|
|
|
#include <net/net_namespace.h>
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
|
2007-07-20 18:58:13 +07:00
|
|
|
u64 uevent_seqnum;
|
2014-04-11 04:09:31 +07:00
|
|
|
#ifdef CONFIG_UEVENT_HELPER
|
2007-08-15 20:38:28 +07:00
|
|
|
char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
|
2014-04-11 04:09:31 +07:00
|
|
|
#endif
|
2018-03-19 19:17:30 +07:00
|
|
|
|
2010-05-05 07:36:44 +07:00
|
|
|
struct uevent_sock {
|
|
|
|
struct list_head list;
|
|
|
|
struct sock *sk;
|
|
|
|
};
|
2018-03-19 19:17:30 +07:00
|
|
|
|
|
|
|
#ifdef CONFIG_NET
|
2010-05-05 07:36:44 +07:00
|
|
|
static LIST_HEAD(uevent_sock_list);
|
2007-07-20 18:58:13 +07:00
|
|
|
#endif
|
|
|
|
|
2012-03-07 17:49:56 +07:00
|
|
|
/* This lock protects uevent_seqnum and uevent_sock_list */
|
|
|
|
static DEFINE_MUTEX(uevent_sock_mutex);
|
|
|
|
|
2007-08-13 01:43:55 +07:00
|
|
|
/* the strings here must match the enum in include/linux/kobject.h */
|
|
|
|
static const char *kobject_actions[] = {
|
|
|
|
[KOBJ_ADD] = "add",
|
|
|
|
[KOBJ_REMOVE] = "remove",
|
|
|
|
[KOBJ_CHANGE] = "change",
|
|
|
|
[KOBJ_MOVE] = "move",
|
|
|
|
[KOBJ_ONLINE] = "online",
|
|
|
|
[KOBJ_OFFLINE] = "offline",
|
2017-07-20 07:24:30 +07:00
|
|
|
[KOBJ_BIND] = "bind",
|
|
|
|
[KOBJ_UNBIND] = "unbind",
|
2007-08-13 01:43:55 +07:00
|
|
|
};
|
|
|
|
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
static int kobject_action_type(const char *buf, size_t count,
|
|
|
|
enum kobject_action *type,
|
|
|
|
const char **args)
|
2007-08-13 01:43:55 +07:00
|
|
|
{
|
|
|
|
enum kobject_action action;
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
size_t count_first;
|
|
|
|
const char *args_start;
|
2007-08-13 01:43:55 +07:00
|
|
|
int ret = -EINVAL;
|
|
|
|
|
2008-03-29 06:05:25 +07:00
|
|
|
if (count && (buf[count-1] == '\n' || buf[count-1] == '\0'))
|
2007-08-13 01:43:55 +07:00
|
|
|
count--;
|
|
|
|
|
|
|
|
if (!count)
|
|
|
|
goto out;
|
|
|
|
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
args_start = strnchr(buf, count, ' ');
|
|
|
|
if (args_start) {
|
|
|
|
count_first = args_start - buf;
|
|
|
|
args_start = args_start + 1;
|
|
|
|
} else
|
|
|
|
count_first = count;
|
|
|
|
|
2007-08-13 01:43:55 +07:00
|
|
|
for (action = 0; action < ARRAY_SIZE(kobject_actions); action++) {
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
if (strncmp(kobject_actions[action], buf, count_first) != 0)
|
2007-08-13 01:43:55 +07:00
|
|
|
continue;
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
if (kobject_actions[action][count_first] != '\0')
|
2007-08-13 01:43:55 +07:00
|
|
|
continue;
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
if (args)
|
|
|
|
*args = args_start;
|
2007-08-13 01:43:55 +07:00
|
|
|
*type = action;
|
|
|
|
ret = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
static const char *action_arg_word_end(const char *buf, const char *buf_end,
|
|
|
|
char delim)
|
|
|
|
{
|
|
|
|
const char *next = buf;
|
|
|
|
|
|
|
|
while (next <= buf_end && *next != delim)
|
|
|
|
if (!isalnum(*next++))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (next == buf)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return next;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kobject_action_args(const char *buf, size_t count,
|
|
|
|
struct kobj_uevent_env **ret_env)
|
|
|
|
{
|
|
|
|
struct kobj_uevent_env *env = NULL;
|
|
|
|
const char *next, *buf_end, *key;
|
|
|
|
int key_len;
|
|
|
|
int r = -EINVAL;
|
|
|
|
|
|
|
|
if (count && (buf[count - 1] == '\n' || buf[count - 1] == '\0'))
|
|
|
|
count--;
|
|
|
|
|
|
|
|
if (!count)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
env = kzalloc(sizeof(*env), GFP_KERNEL);
|
|
|
|
if (!env)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* first arg is UUID */
|
|
|
|
if (count < UUID_STRING_LEN || !uuid_is_valid(buf) ||
|
|
|
|
add_uevent_var(env, "SYNTH_UUID=%.*s", UUID_STRING_LEN, buf))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the rest are custom environment variables in KEY=VALUE
|
|
|
|
* format with ' ' delimiter between each KEY=VALUE pair
|
|
|
|
*/
|
|
|
|
next = buf + UUID_STRING_LEN;
|
|
|
|
buf_end = buf + count - 1;
|
|
|
|
|
|
|
|
while (next <= buf_end) {
|
|
|
|
if (*next != ' ')
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* skip the ' ', key must follow */
|
|
|
|
key = ++next;
|
|
|
|
if (key > buf_end)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
buf = next;
|
|
|
|
next = action_arg_word_end(buf, buf_end, '=');
|
|
|
|
if (!next || next > buf_end || *next != '=')
|
|
|
|
goto out;
|
|
|
|
key_len = next - buf;
|
|
|
|
|
|
|
|
/* skip the '=', value must follow */
|
|
|
|
if (++next > buf_end)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
buf = next;
|
|
|
|
next = action_arg_word_end(buf, buf_end, ' ');
|
|
|
|
if (!next)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (add_uevent_var(env, "SYNTH_ARG_%.*s=%.*s",
|
|
|
|
key_len, key, (int) (next - buf), buf))
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
out:
|
|
|
|
if (r)
|
|
|
|
kfree(env);
|
|
|
|
else
|
|
|
|
*ret_env = env;
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kobject_synth_uevent - send synthetic uevent with arguments
|
|
|
|
*
|
|
|
|
* @kobj: struct kobject for which synthetic uevent is to be generated
|
|
|
|
* @buf: buffer containing action type and action args, newline is ignored
|
|
|
|
* @count: length of buffer
|
|
|
|
*
|
|
|
|
* Returns 0 if kobject_synthetic_uevent() is completed with success or the
|
|
|
|
* corresponding error when it fails.
|
|
|
|
*/
|
|
|
|
int kobject_synth_uevent(struct kobject *kobj, const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
char *no_uuid_envp[] = { "SYNTH_UUID=0", NULL };
|
|
|
|
enum kobject_action action;
|
|
|
|
const char *action_args;
|
|
|
|
struct kobj_uevent_env *env;
|
|
|
|
const char *msg = NULL, *devpath;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
r = kobject_action_type(buf, count, &action, &action_args);
|
|
|
|
if (r) {
|
2019-01-09 16:17:48 +07:00
|
|
|
msg = "unknown uevent action string";
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!action_args) {
|
|
|
|
r = kobject_uevent_env(kobj, action, no_uuid_envp);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
r = kobject_action_args(action_args,
|
|
|
|
count - (action_args - buf), &env);
|
|
|
|
if (r == -EINVAL) {
|
2019-01-09 16:17:48 +07:00
|
|
|
msg = "incorrect uevent action arguments";
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = kobject_uevent_env(kobj, action, env->envp);
|
|
|
|
kfree(env);
|
|
|
|
out:
|
|
|
|
if (r) {
|
|
|
|
devpath = kobject_get_path(kobj, GFP_KERNEL);
|
2019-01-09 16:17:48 +07:00
|
|
|
pr_warn("synth uevent: %s: %s\n",
|
kobject: support passing in variables for synthetic uevents
This patch makes it possible to pass additional arguments in addition
to uevent action name when writing /sys/.../uevent attribute. These
additional arguments are then inserted into generated synthetic uevent
as additional environment variables.
Before, we were not able to pass any additional uevent environment
variables for synthetic uevents. This made it hard to identify such uevents
properly in userspace to make proper distinction between genuine uevents
originating from kernel and synthetic uevents triggered from userspace.
Also, it was not possible to pass any additional information which would
make it possible to optimize and change the way the synthetic uevents are
processed back in userspace based on the originating environment of the
triggering action in userspace. With the extra additional variables, we are
able to pass through this extra information needed and also it makes it
possible to synchronize with such synthetic uevents as they can be clearly
identified back in userspace.
The format for writing the uevent attribute is following:
ACTION [UUID [KEY=VALUE ...]
There's no change in how "ACTION" is recognized - it stays the same
("add", "change", "remove"). The "ACTION" is the only argument required
to generate synthetic uevent, the rest of arguments, that this patch
adds support for, are optional.
The "UUID" is considered as transaction identifier so it's possible to
use the same UUID value for one or more synthetic uevents in which case
we logically group these uevents together for any userspace listeners.
The "UUID" is expected to be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
format where "x" is a hex digit. The value appears in uevent as
"SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment variable.
The "KEY=VALUE" pairs can contain alphanumeric characters only. It's
possible to define zero or more more pairs - each pair is then delimited
by a space character " ". Each pair appears in synthetic uevents as
"SYNTH_ARG_KEY=VALUE" environment variable. That means the KEY name gains
"SYNTH_ARG_" prefix to avoid possible collisions with existing variables.
To pass the "KEY=VALUE" pairs, it's also required to pass in the "UUID"
part for the synthetic uevent first.
If "UUID" is not passed in, the generated synthetic uevent gains
"SYNTH_UUID=0" environment variable automatically so it's possible to
identify this situation in userspace when reading generated uevent and so
we can still make a difference between genuine and synthetic uevents.
Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-09 20:22:30 +07:00
|
|
|
devpath ?: "unknown device",
|
|
|
|
msg ?: "failed to send uevent");
|
|
|
|
kfree(devpath);
|
|
|
|
}
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2014-04-11 04:09:31 +07:00
|
|
|
#ifdef CONFIG_UEVENT_HELPER
|
2010-05-05 07:36:48 +07:00
|
|
|
static int kobj_usermode_filter(struct kobject *kobj)
|
|
|
|
{
|
|
|
|
const struct kobj_ns_type_operations *ops;
|
|
|
|
|
|
|
|
ops = kobj_ns_ops(kobj);
|
|
|
|
if (ops) {
|
|
|
|
const void *init_ns, *ns;
|
2018-10-30 19:01:15 +07:00
|
|
|
|
2010-05-05 07:36:48 +07:00
|
|
|
ns = kobj->ktype->namespace(kobj);
|
|
|
|
init_ns = ops->initial_ns();
|
|
|
|
return ns != init_ns;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
kobject: don't block for each kobject_uevent
Currently kobject_uevent has somewhat unpredictable semantics. The
point is, since it may call a usermode helper and wait for it to execute
(UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
it will introduce for the caller - strictly speaking it depends on what
fs the binary is located on and the set of locks fork may take. There
are quite a few kobject_uevent's users that do not take this into
account and call it with various mutexes taken, e.g. rtnl_mutex,
net_mutex, which might potentially lead to a deadlock.
Since there is actually no reason to wait for the usermode helper to
execute there, let's make kobject_uevent start the helper asynchronously
with the aid of the UMH_NO_WAIT flag.
Personally, I'm interested in this, because I really want kobject_uevent
to be called under the slab_mutex in the slub implementation as it used
to be some time ago, because it greatly simplifies synchronization and
automatically fixes a kmemcg-related race. However, there was a
deadlock detected on an attempt to call kobject_uevent under the
slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
to be fixed by releasing the slab_mutex for kobject_uevent.
Unfortunately, there was no information about who exactly blocked on the
slab_mutex causing the usermode helper to stall, neither have I managed
to find this out or reproduce the issue.
BTW, this is not the first attempt to make kobject_uevent use
UMH_NO_WAIT. Previous one was made by commit f520360d93cd ("kobject:
don't block for each kobject_uevent"), but it was wrong (it passed
arguments allocated on stack to async thread) so it was reverted in
05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
It targeted on speeding up the boot process though.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg KH <greg@kroah.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 04:48:21 +07:00
|
|
|
static int init_uevent_argv(struct kobj_uevent_env *env, const char *subsystem)
|
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
|
|
|
len = strlcpy(&env->buf[env->buflen], subsystem,
|
|
|
|
sizeof(env->buf) - env->buflen);
|
|
|
|
if (len >= (sizeof(env->buf) - env->buflen)) {
|
|
|
|
WARN(1, KERN_ERR "init_uevent_argv: buffer size too small\n");
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
env->argv[0] = uevent_helper;
|
|
|
|
env->argv[1] = &env->buf[env->buflen];
|
|
|
|
env->argv[2] = NULL;
|
|
|
|
|
|
|
|
env->buflen += len + 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void cleanup_uevent_env(struct subprocess_info *info)
|
|
|
|
{
|
|
|
|
kfree(info->data);
|
|
|
|
}
|
2014-04-11 04:09:31 +07:00
|
|
|
#endif
|
kobject: don't block for each kobject_uevent
Currently kobject_uevent has somewhat unpredictable semantics. The
point is, since it may call a usermode helper and wait for it to execute
(UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
it will introduce for the caller - strictly speaking it depends on what
fs the binary is located on and the set of locks fork may take. There
are quite a few kobject_uevent's users that do not take this into
account and call it with various mutexes taken, e.g. rtnl_mutex,
net_mutex, which might potentially lead to a deadlock.
Since there is actually no reason to wait for the usermode helper to
execute there, let's make kobject_uevent start the helper asynchronously
with the aid of the UMH_NO_WAIT flag.
Personally, I'm interested in this, because I really want kobject_uevent
to be called under the slab_mutex in the slub implementation as it used
to be some time ago, because it greatly simplifies synchronization and
automatically fixes a kmemcg-related race. However, there was a
deadlock detected on an attempt to call kobject_uevent under the
slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
to be fixed by releasing the slab_mutex for kobject_uevent.
Unfortunately, there was no information about who exactly blocked on the
slab_mutex causing the usermode helper to stall, neither have I managed
to find this out or reproduce the issue.
BTW, this is not the first attempt to make kobject_uevent use
UMH_NO_WAIT. Previous one was made by commit f520360d93cd ("kobject:
don't block for each kobject_uevent"), but it was wrong (it passed
arguments allocated on stack to async thread) so it was reverted in
05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
It targeted on speeding up the boot process though.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg KH <greg@kroah.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 04:48:21 +07:00
|
|
|
|
2018-04-29 17:44:11 +07:00
|
|
|
#ifdef CONFIG_NET
|
|
|
|
static struct sk_buff *alloc_uevent_skb(struct kobj_uevent_env *env,
|
|
|
|
const char *action_string,
|
|
|
|
const char *devpath)
|
|
|
|
{
|
|
|
|
struct netlink_skb_parms *parms;
|
|
|
|
struct sk_buff *skb = NULL;
|
|
|
|
char *scratch;
|
|
|
|
size_t len;
|
|
|
|
|
|
|
|
/* allocate message with maximum possible size */
|
|
|
|
len = strlen(action_string) + strlen(devpath) + 2;
|
|
|
|
skb = alloc_skb(len + env->buflen, GFP_KERNEL);
|
|
|
|
if (!skb)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* add header */
|
|
|
|
scratch = skb_put(skb, len);
|
|
|
|
sprintf(scratch, "%s@%s", action_string, devpath);
|
|
|
|
|
|
|
|
skb_put_data(skb, env->buf, env->buflen);
|
|
|
|
|
|
|
|
parms = &NETLINK_CB(skb);
|
|
|
|
parms->creds.uid = GLOBAL_ROOT_UID;
|
|
|
|
parms->creds.gid = GLOBAL_ROOT_GID;
|
|
|
|
parms->dst_group = 1;
|
|
|
|
parms->portid = 0;
|
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
static int uevent_net_broadcast_untagged(struct kobj_uevent_env *env,
|
|
|
|
const char *action_string,
|
|
|
|
const char *devpath)
|
2017-09-20 06:27:03 +07:00
|
|
|
{
|
2017-09-20 06:27:05 +07:00
|
|
|
struct sk_buff *skb = NULL;
|
2017-09-20 06:27:03 +07:00
|
|
|
struct uevent_sock *ue_sk;
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
int retval = 0;
|
2017-09-20 06:27:03 +07:00
|
|
|
|
|
|
|
/* send netlink message */
|
|
|
|
list_for_each_entry(ue_sk, &uevent_sock_list, list) {
|
|
|
|
struct sock *uevent_sock = ue_sk->sk;
|
|
|
|
|
|
|
|
if (!netlink_has_listeners(uevent_sock, 1))
|
|
|
|
continue;
|
|
|
|
|
2017-09-20 06:27:05 +07:00
|
|
|
if (!skb) {
|
|
|
|
retval = -ENOMEM;
|
2018-04-29 17:44:11 +07:00
|
|
|
skb = alloc_uevent_skb(env, action_string, devpath);
|
2017-09-20 06:27:05 +07:00
|
|
|
if (!skb)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
retval = netlink_broadcast(uevent_sock, skb_get(skb), 0, 1,
|
|
|
|
GFP_KERNEL);
|
2017-09-20 06:27:05 +07:00
|
|
|
/* ENOBUFS should be handled in userspace */
|
|
|
|
if (retval == -ENOBUFS || retval == -ESRCH)
|
|
|
|
retval = 0;
|
2017-09-20 06:27:03 +07:00
|
|
|
}
|
2017-09-20 06:27:05 +07:00
|
|
|
consume_skb(skb);
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
|
2017-09-20 06:27:03 +07:00
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
static int uevent_net_broadcast_tagged(struct sock *usk,
|
|
|
|
struct kobj_uevent_env *env,
|
|
|
|
const char *action_string,
|
|
|
|
const char *devpath)
|
|
|
|
{
|
|
|
|
struct user_namespace *owning_user_ns = sock_net(usk)->user_ns;
|
|
|
|
struct sk_buff *skb = NULL;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
skb = alloc_uevent_skb(env, action_string, devpath);
|
|
|
|
if (!skb)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* fix credentials */
|
|
|
|
if (owning_user_ns != &init_user_ns) {
|
|
|
|
struct netlink_skb_parms *parms = &NETLINK_CB(skb);
|
|
|
|
kuid_t root_uid;
|
|
|
|
kgid_t root_gid;
|
|
|
|
|
|
|
|
/* fix uid */
|
|
|
|
root_uid = make_kuid(owning_user_ns, 0);
|
|
|
|
if (uid_valid(root_uid))
|
|
|
|
parms->creds.uid = root_uid;
|
|
|
|
|
|
|
|
/* fix gid */
|
|
|
|
root_gid = make_kgid(owning_user_ns, 0);
|
|
|
|
if (gid_valid(root_gid))
|
|
|
|
parms->creds.gid = root_gid;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = netlink_broadcast(usk, skb, 0, 1, GFP_KERNEL);
|
|
|
|
/* ENOBUFS should be handled in userspace */
|
|
|
|
if (ret == -ENOBUFS || ret == -ESRCH)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static int kobject_uevent_net_broadcast(struct kobject *kobj,
|
|
|
|
struct kobj_uevent_env *env,
|
|
|
|
const char *action_string,
|
|
|
|
const char *devpath)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_NET
|
|
|
|
const struct kobj_ns_type_operations *ops;
|
|
|
|
const struct net *net = NULL;
|
|
|
|
|
|
|
|
ops = kobj_ns_ops(kobj);
|
|
|
|
if (!ops && kobj->kset) {
|
|
|
|
struct kobject *ksobj = &kobj->kset->kobj;
|
2018-10-30 19:01:15 +07:00
|
|
|
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
if (ksobj->parent != NULL)
|
|
|
|
ops = kobj_ns_ops(ksobj->parent);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* kobjects currently only carry network namespace tags and they
|
|
|
|
* are the only tag relevant here since we want to decide which
|
|
|
|
* network namespaces to broadcast the uevent into.
|
|
|
|
*/
|
|
|
|
if (ops && ops->netlink_ns && kobj->ktype->namespace)
|
|
|
|
if (ops->type == KOBJ_NS_TYPE_NET)
|
|
|
|
net = kobj->ktype->namespace(kobj);
|
|
|
|
|
|
|
|
if (!net)
|
|
|
|
ret = uevent_net_broadcast_untagged(env, action_string,
|
|
|
|
devpath);
|
|
|
|
else
|
|
|
|
ret = uevent_net_broadcast_tagged(net->uevent_sock->sk, env,
|
|
|
|
action_string, devpath);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-09-14 06:29:48 +07:00
|
|
|
static void zap_modalias_env(struct kobj_uevent_env *env)
|
|
|
|
{
|
|
|
|
static const char modalias_prefix[] = "MODALIAS=";
|
2017-12-14 06:21:22 +07:00
|
|
|
size_t len;
|
|
|
|
int i, j;
|
2017-09-14 06:29:48 +07:00
|
|
|
|
|
|
|
for (i = 0; i < env->envp_idx;) {
|
|
|
|
if (strncmp(env->envp[i], modalias_prefix,
|
|
|
|
sizeof(modalias_prefix) - 1)) {
|
|
|
|
i++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-12-14 06:21:22 +07:00
|
|
|
len = strlen(env->envp[i]) + 1;
|
|
|
|
|
|
|
|
if (i != env->envp_idx - 1) {
|
|
|
|
memmove(env->envp[i], env->envp[i + 1],
|
|
|
|
env->buflen - len);
|
|
|
|
|
|
|
|
for (j = i; j < env->envp_idx - 1; j++)
|
|
|
|
env->envp[j] = env->envp[j + 1] - len;
|
|
|
|
}
|
2017-09-14 06:29:48 +07:00
|
|
|
|
|
|
|
env->envp_idx--;
|
2017-12-14 06:21:22 +07:00
|
|
|
env->buflen -= len;
|
2017-09-14 06:29:48 +07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-17 05:20:36 +07:00
|
|
|
/**
|
2006-11-20 23:07:51 +07:00
|
|
|
* kobject_uevent_env - send an uevent with environmental data
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* @kobj: struct kobject that the action is happening to
|
2016-10-02 02:46:28 +07:00
|
|
|
* @action: action that is happening
|
2006-11-20 23:07:51 +07:00
|
|
|
* @envp_ext: pointer to environmental data
|
2006-12-20 04:01:27 +07:00
|
|
|
*
|
2010-08-13 17:58:10 +07:00
|
|
|
* Returns 0 if kobject_uevent_env() is completed with success or the
|
2006-12-20 04:01:27 +07:00
|
|
|
* corresponding error when it fails.
|
2005-04-17 05:20:36 +07:00
|
|
|
*/
|
2006-12-20 04:01:27 +07:00
|
|
|
int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
|
2007-08-14 20:15:12 +07:00
|
|
|
char *envp_ext[])
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
2007-08-14 20:15:12 +07:00
|
|
|
struct kobj_uevent_env *env;
|
|
|
|
const char *action_string = kobject_actions[action];
|
2005-11-11 20:43:07 +07:00
|
|
|
const char *devpath = NULL;
|
|
|
|
const char *subsystem;
|
|
|
|
struct kobject *top_kobj;
|
|
|
|
struct kset *kset;
|
2009-12-31 20:52:51 +07:00
|
|
|
const struct kset_uevent_ops *uevent_ops;
|
2005-04-17 05:20:36 +07:00
|
|
|
int i = 0;
|
2006-12-20 04:01:27 +07:00
|
|
|
int retval = 0;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-11-29 14:49:41 +07:00
|
|
|
pr_debug("kobject: '%s' (%p): %s\n",
|
2008-04-30 14:55:08 +07:00
|
|
|
kobject_name(kobj), kobj, __func__);
|
2005-11-11 20:43:07 +07:00
|
|
|
|
|
|
|
/* search the kset we belong to */
|
|
|
|
top_kobj = kobj;
|
2007-08-13 01:43:55 +07:00
|
|
|
while (!top_kobj->kset && top_kobj->parent)
|
2007-04-04 18:39:17 +07:00
|
|
|
top_kobj = top_kobj->parent;
|
2007-08-13 01:43:55 +07:00
|
|
|
|
2006-12-20 04:01:27 +07:00
|
|
|
if (!top_kobj->kset) {
|
2007-11-29 14:49:41 +07:00
|
|
|
pr_debug("kobject: '%s' (%p): %s: attempted to send uevent "
|
|
|
|
"without kset!\n", kobject_name(kobj), kobj,
|
2008-04-30 14:55:08 +07:00
|
|
|
__func__);
|
2006-12-20 04:01:27 +07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-11 20:43:07 +07:00
|
|
|
kset = top_kobj->kset;
|
2005-11-16 15:00:00 +07:00
|
|
|
uevent_ops = kset->uevent_ops;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2009-03-01 20:10:49 +07:00
|
|
|
/* skip the event, if uevent_suppress is set*/
|
|
|
|
if (kobj->uevent_suppress) {
|
|
|
|
pr_debug("kobject: '%s' (%p): %s: uevent_suppress "
|
|
|
|
"caused the event to drop!\n",
|
|
|
|
kobject_name(kobj), kobj, __func__);
|
|
|
|
return 0;
|
|
|
|
}
|
2007-08-14 20:15:12 +07:00
|
|
|
/* skip the event, if the filter returns zero. */
|
2005-11-16 15:00:00 +07:00
|
|
|
if (uevent_ops && uevent_ops->filter)
|
2006-12-20 04:01:27 +07:00
|
|
|
if (!uevent_ops->filter(kset, kobj)) {
|
2007-11-29 14:49:41 +07:00
|
|
|
pr_debug("kobject: '%s' (%p): %s: filter function "
|
|
|
|
"caused the event to drop!\n",
|
2008-04-30 14:55:08 +07:00
|
|
|
kobject_name(kobj), kobj, __func__);
|
2006-12-20 04:01:27 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-03-14 09:25:56 +07:00
|
|
|
/* originating subsystem */
|
|
|
|
if (uevent_ops && uevent_ops->name)
|
|
|
|
subsystem = uevent_ops->name(kset, kobj);
|
|
|
|
else
|
|
|
|
subsystem = kobject_name(&kset->kobj);
|
|
|
|
if (!subsystem) {
|
2007-11-29 14:49:41 +07:00
|
|
|
pr_debug("kobject: '%s' (%p): %s: unset subsystem caused the "
|
|
|
|
"event to drop!\n", kobject_name(kobj), kobj,
|
2008-04-30 14:55:08 +07:00
|
|
|
__func__);
|
2007-03-14 09:25:56 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-08-14 20:15:12 +07:00
|
|
|
/* environment buffer */
|
|
|
|
env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
|
|
|
|
if (!env)
|
2006-12-20 04:01:27 +07:00
|
|
|
return -ENOMEM;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-11 20:43:07 +07:00
|
|
|
/* complete object path */
|
|
|
|
devpath = kobject_get_path(kobj, GFP_KERNEL);
|
2006-12-20 04:01:27 +07:00
|
|
|
if (!devpath) {
|
|
|
|
retval = -ENOENT;
|
2005-11-11 20:43:07 +07:00
|
|
|
goto exit;
|
2006-12-20 04:01:27 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-11 20:43:07 +07:00
|
|
|
/* default keys */
|
2007-08-14 20:15:12 +07:00
|
|
|
retval = add_uevent_var(env, "ACTION=%s", action_string);
|
|
|
|
if (retval)
|
|
|
|
goto exit;
|
|
|
|
retval = add_uevent_var(env, "DEVPATH=%s", devpath);
|
|
|
|
if (retval)
|
|
|
|
goto exit;
|
|
|
|
retval = add_uevent_var(env, "SUBSYSTEM=%s", subsystem);
|
|
|
|
if (retval)
|
|
|
|
goto exit;
|
|
|
|
|
|
|
|
/* keys passed in from the caller */
|
|
|
|
if (envp_ext) {
|
|
|
|
for (i = 0; envp_ext[i]; i++) {
|
2008-11-13 11:20:00 +07:00
|
|
|
retval = add_uevent_var(env, "%s", envp_ext[i]);
|
2007-08-14 20:15:12 +07:00
|
|
|
if (retval)
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2005-11-11 20:43:07 +07:00
|
|
|
/* let the kset specific function add its stuff */
|
2005-11-16 15:00:00 +07:00
|
|
|
if (uevent_ops && uevent_ops->uevent) {
|
2007-08-14 20:15:12 +07:00
|
|
|
retval = uevent_ops->uevent(kset, kobj, env);
|
2005-04-17 05:20:36 +07:00
|
|
|
if (retval) {
|
2007-11-29 14:49:41 +07:00
|
|
|
pr_debug("kobject: '%s' (%p): %s: uevent() returned "
|
|
|
|
"%d\n", kobject_name(kobj), kobj,
|
2008-04-30 14:55:08 +07:00
|
|
|
__func__, retval);
|
2005-04-17 05:20:36 +07:00
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-09-14 06:29:48 +07:00
|
|
|
switch (action) {
|
|
|
|
case KOBJ_ADD:
|
|
|
|
/*
|
|
|
|
* Mark "add" event so we can make sure we deliver "remove"
|
|
|
|
* event to userspace during automatic cleanup. If
|
|
|
|
* the object did send an "add" event, "remove" will
|
|
|
|
* automatically generated by the core, if not already done
|
|
|
|
* by the caller.
|
|
|
|
*/
|
2007-12-19 07:40:42 +07:00
|
|
|
kobj->state_add_uevent_sent = 1;
|
2017-09-14 06:29:48 +07:00
|
|
|
break;
|
|
|
|
|
|
|
|
case KOBJ_REMOVE:
|
2007-12-19 07:40:42 +07:00
|
|
|
kobj->state_remove_uevent_sent = 1;
|
2017-09-14 06:29:48 +07:00
|
|
|
break;
|
|
|
|
|
|
|
|
case KOBJ_UNBIND:
|
|
|
|
zap_modalias_env(env);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2007-12-19 07:40:42 +07:00
|
|
|
|
2012-03-07 17:49:56 +07:00
|
|
|
mutex_lock(&uevent_sock_mutex);
|
2007-08-14 20:15:12 +07:00
|
|
|
/* we will send an event, so request a new sequence number */
|
2018-10-30 19:01:14 +07:00
|
|
|
retval = add_uevent_var(env, "SEQNUM=%llu", ++uevent_seqnum);
|
2012-03-07 17:49:56 +07:00
|
|
|
if (retval) {
|
|
|
|
mutex_unlock(&uevent_sock_mutex);
|
2007-08-14 20:15:12 +07:00
|
|
|
goto exit;
|
2012-03-07 17:49:56 +07:00
|
|
|
}
|
2017-09-20 06:27:03 +07:00
|
|
|
retval = kobject_uevent_net_broadcast(kobj, env, action_string,
|
|
|
|
devpath);
|
2012-03-07 17:49:56 +07:00
|
|
|
mutex_unlock(&uevent_sock_mutex);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2014-04-11 04:09:31 +07:00
|
|
|
#ifdef CONFIG_UEVENT_HELPER
|
2005-11-11 20:43:07 +07:00
|
|
|
/* call uevent_helper, usually only enabled during early boot */
|
2010-05-05 07:36:48 +07:00
|
|
|
if (uevent_helper[0] && !kobj_usermode_filter(kobj)) {
|
kobject: don't block for each kobject_uevent
Currently kobject_uevent has somewhat unpredictable semantics. The
point is, since it may call a usermode helper and wait for it to execute
(UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
it will introduce for the caller - strictly speaking it depends on what
fs the binary is located on and the set of locks fork may take. There
are quite a few kobject_uevent's users that do not take this into
account and call it with various mutexes taken, e.g. rtnl_mutex,
net_mutex, which might potentially lead to a deadlock.
Since there is actually no reason to wait for the usermode helper to
execute there, let's make kobject_uevent start the helper asynchronously
with the aid of the UMH_NO_WAIT flag.
Personally, I'm interested in this, because I really want kobject_uevent
to be called under the slab_mutex in the slub implementation as it used
to be some time ago, because it greatly simplifies synchronization and
automatically fixes a kmemcg-related race. However, there was a
deadlock detected on an attempt to call kobject_uevent under the
slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
to be fixed by releasing the slab_mutex for kobject_uevent.
Unfortunately, there was no information about who exactly blocked on the
slab_mutex causing the usermode helper to stall, neither have I managed
to find this out or reproduce the issue.
BTW, this is not the first attempt to make kobject_uevent use
UMH_NO_WAIT. Previous one was made by commit f520360d93cd ("kobject:
don't block for each kobject_uevent"), but it was wrong (it passed
arguments allocated on stack to async thread) so it was reverted in
05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
It targeted on speeding up the boot process though.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg KH <greg@kroah.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 04:48:21 +07:00
|
|
|
struct subprocess_info *info;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-08-14 20:15:12 +07:00
|
|
|
retval = add_uevent_var(env, "HOME=/");
|
|
|
|
if (retval)
|
|
|
|
goto exit;
|
2008-01-25 12:59:04 +07:00
|
|
|
retval = add_uevent_var(env,
|
|
|
|
"PATH=/sbin:/bin:/usr/sbin:/usr/bin");
|
2007-08-14 20:15:12 +07:00
|
|
|
if (retval)
|
|
|
|
goto exit;
|
kobject: don't block for each kobject_uevent
Currently kobject_uevent has somewhat unpredictable semantics. The
point is, since it may call a usermode helper and wait for it to execute
(UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
it will introduce for the caller - strictly speaking it depends on what
fs the binary is located on and the set of locks fork may take. There
are quite a few kobject_uevent's users that do not take this into
account and call it with various mutexes taken, e.g. rtnl_mutex,
net_mutex, which might potentially lead to a deadlock.
Since there is actually no reason to wait for the usermode helper to
execute there, let's make kobject_uevent start the helper asynchronously
with the aid of the UMH_NO_WAIT flag.
Personally, I'm interested in this, because I really want kobject_uevent
to be called under the slab_mutex in the slub implementation as it used
to be some time ago, because it greatly simplifies synchronization and
automatically fixes a kmemcg-related race. However, there was a
deadlock detected on an attempt to call kobject_uevent under the
slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
to be fixed by releasing the slab_mutex for kobject_uevent.
Unfortunately, there was no information about who exactly blocked on the
slab_mutex causing the usermode helper to stall, neither have I managed
to find this out or reproduce the issue.
BTW, this is not the first attempt to make kobject_uevent use
UMH_NO_WAIT. Previous one was made by commit f520360d93cd ("kobject:
don't block for each kobject_uevent"), but it was wrong (it passed
arguments allocated on stack to async thread) so it was reverted in
05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
It targeted on speeding up the boot process though.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg KH <greg@kroah.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 04:48:21 +07:00
|
|
|
retval = init_uevent_argv(env, subsystem);
|
|
|
|
if (retval)
|
|
|
|
goto exit;
|
2007-08-14 20:15:12 +07:00
|
|
|
|
kobject: don't block for each kobject_uevent
Currently kobject_uevent has somewhat unpredictable semantics. The
point is, since it may call a usermode helper and wait for it to execute
(UMH_WAIT_EXEC), it is impossible to say for sure what lock dependencies
it will introduce for the caller - strictly speaking it depends on what
fs the binary is located on and the set of locks fork may take. There
are quite a few kobject_uevent's users that do not take this into
account and call it with various mutexes taken, e.g. rtnl_mutex,
net_mutex, which might potentially lead to a deadlock.
Since there is actually no reason to wait for the usermode helper to
execute there, let's make kobject_uevent start the helper asynchronously
with the aid of the UMH_NO_WAIT flag.
Personally, I'm interested in this, because I really want kobject_uevent
to be called under the slab_mutex in the slub implementation as it used
to be some time ago, because it greatly simplifies synchronization and
automatically fixes a kmemcg-related race. However, there was a
deadlock detected on an attempt to call kobject_uevent under the
slab_mutex (see https://lkml.org/lkml/2012/1/14/45), which was reported
to be fixed by releasing the slab_mutex for kobject_uevent.
Unfortunately, there was no information about who exactly blocked on the
slab_mutex causing the usermode helper to stall, neither have I managed
to find this out or reproduce the issue.
BTW, this is not the first attempt to make kobject_uevent use
UMH_NO_WAIT. Previous one was made by commit f520360d93cd ("kobject:
don't block for each kobject_uevent"), but it was wrong (it passed
arguments allocated on stack to async thread) so it was reverted in
05f54c13cd0c ("Revert "kobject: don't block for each kobject_uevent".").
It targeted on speeding up the boot process though.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg KH <greg@kroah.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 04:48:21 +07:00
|
|
|
retval = -ENOMEM;
|
|
|
|
info = call_usermodehelper_setup(env->argv[0], env->argv,
|
|
|
|
env->envp, GFP_KERNEL,
|
|
|
|
NULL, cleanup_uevent_env, env);
|
|
|
|
if (info) {
|
|
|
|
retval = call_usermodehelper_exec(info, UMH_NO_WAIT);
|
|
|
|
env = NULL; /* freed by cleanup_uevent_env */
|
|
|
|
}
|
2005-11-11 20:43:07 +07:00
|
|
|
}
|
2014-04-11 04:09:31 +07:00
|
|
|
#endif
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
exit:
|
2005-11-11 20:43:07 +07:00
|
|
|
kfree(devpath);
|
2007-08-14 20:15:12 +07:00
|
|
|
kfree(env);
|
2006-12-20 04:01:27 +07:00
|
|
|
return retval;
|
2005-04-17 05:20:36 +07:00
|
|
|
}
|
2006-11-20 23:07:51 +07:00
|
|
|
EXPORT_SYMBOL_GPL(kobject_uevent_env);
|
|
|
|
|
|
|
|
/**
|
2010-08-13 17:58:10 +07:00
|
|
|
* kobject_uevent - notify userspace by sending an uevent
|
2006-11-20 23:07:51 +07:00
|
|
|
*
|
|
|
|
* @kobj: struct kobject that the action is happening to
|
2016-10-02 02:46:28 +07:00
|
|
|
* @action: action that is happening
|
2006-12-20 04:01:27 +07:00
|
|
|
*
|
|
|
|
* Returns 0 if kobject_uevent() is completed with success or the
|
|
|
|
* corresponding error when it fails.
|
2006-11-20 23:07:51 +07:00
|
|
|
*/
|
2006-12-20 04:01:27 +07:00
|
|
|
int kobject_uevent(struct kobject *kobj, enum kobject_action action)
|
2006-11-20 23:07:51 +07:00
|
|
|
{
|
2006-12-20 04:01:27 +07:00
|
|
|
return kobject_uevent_env(kobj, action, NULL);
|
2006-11-20 23:07:51 +07:00
|
|
|
}
|
2005-11-16 15:00:00 +07:00
|
|
|
EXPORT_SYMBOL_GPL(kobject_uevent);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
/**
|
2007-08-14 20:15:12 +07:00
|
|
|
* add_uevent_var - add key value string to the environment buffer
|
|
|
|
* @env: environment buffer structure
|
|
|
|
* @format: printf format for the key=value pair
|
2005-04-17 05:20:36 +07:00
|
|
|
*
|
|
|
|
* Returns 0 if environment variable was added successfully or -ENOMEM
|
|
|
|
* if no space was available.
|
|
|
|
*/
|
2007-08-14 20:15:12 +07:00
|
|
|
int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
|
2005-04-17 05:20:36 +07:00
|
|
|
{
|
|
|
|
va_list args;
|
2007-08-14 20:15:12 +07:00
|
|
|
int len;
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-08-14 20:15:12 +07:00
|
|
|
if (env->envp_idx >= ARRAY_SIZE(env->envp)) {
|
2008-07-26 09:45:39 +07:00
|
|
|
WARN(1, KERN_ERR "add_uevent_var: too many keys\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
return -ENOMEM;
|
2007-08-14 20:15:12 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
|
|
|
va_start(args, format);
|
2007-08-14 20:15:12 +07:00
|
|
|
len = vsnprintf(&env->buf[env->buflen],
|
|
|
|
sizeof(env->buf) - env->buflen,
|
|
|
|
format, args);
|
2005-04-17 05:20:36 +07:00
|
|
|
va_end(args);
|
|
|
|
|
2007-08-14 20:15:12 +07:00
|
|
|
if (len >= (sizeof(env->buf) - env->buflen)) {
|
2008-07-26 09:45:39 +07:00
|
|
|
WARN(1, KERN_ERR "add_uevent_var: buffer size too small\n");
|
2005-04-17 05:20:36 +07:00
|
|
|
return -ENOMEM;
|
2007-08-14 20:15:12 +07:00
|
|
|
}
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2007-08-14 20:15:12 +07:00
|
|
|
env->envp[env->envp_idx++] = &env->buf[env->buflen];
|
|
|
|
env->buflen += len + 1;
|
2005-04-17 05:20:36 +07:00
|
|
|
return 0;
|
|
|
|
}
|
2005-11-16 15:00:00 +07:00
|
|
|
EXPORT_SYMBOL_GPL(add_uevent_var);
|
2005-04-17 05:20:36 +07:00
|
|
|
|
2006-04-25 20:37:26 +07:00
|
|
|
#if defined(CONFIG_NET)
|
2018-03-19 19:17:31 +07:00
|
|
|
static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
|
|
|
|
struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
/* u64 to chars: 2^64 - 1 = 21 chars */
|
|
|
|
char buf[sizeof("SEQNUM=") + 21];
|
|
|
|
struct sk_buff *skbc;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* bump and prepare sequence number */
|
|
|
|
ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", ++uevent_seqnum);
|
|
|
|
if (ret < 0 || (size_t)ret >= sizeof(buf))
|
|
|
|
return -ENOMEM;
|
|
|
|
ret++;
|
|
|
|
|
|
|
|
/* verify message does not overflow */
|
|
|
|
if ((skb->len + ret) > UEVENT_BUFFER_SIZE) {
|
|
|
|
NL_SET_ERR_MSG(extack, "uevent message too big");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* copy skb and extend to accommodate sequence number */
|
|
|
|
skbc = skb_copy_expand(skb, 0, ret, GFP_KERNEL);
|
|
|
|
if (!skbc)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* append sequence number */
|
|
|
|
skb_put_data(skbc, buf, ret);
|
|
|
|
|
|
|
|
/* remove msg header */
|
|
|
|
skb_pull(skbc, NLMSG_HDRLEN);
|
|
|
|
|
|
|
|
/* set portid 0 to inform userspace message comes from kernel */
|
|
|
|
NETLINK_CB(skbc).portid = 0;
|
|
|
|
NETLINK_CB(skbc).dst_group = 1;
|
|
|
|
|
|
|
|
ret = netlink_broadcast(usk, skbc, 0, 1, GFP_KERNEL);
|
|
|
|
/* ENOBUFS should be handled in userspace */
|
|
|
|
if (ret == -ENOBUFS || ret == -ESRCH)
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
|
|
|
|
struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
struct net *net;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!nlmsg_data(nlh))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we are allowed to send messages to the target
|
|
|
|
* network namespace. The caller must have CAP_SYS_ADMIN in the
|
|
|
|
* owning user namespace of the target network namespace.
|
|
|
|
*/
|
|
|
|
net = sock_net(NETLINK_CB(skb).sk);
|
|
|
|
if (!netlink_ns_capable(skb, net->user_ns, CAP_SYS_ADMIN)) {
|
|
|
|
NL_SET_ERR_MSG(extack, "missing CAP_SYS_ADMIN capability");
|
|
|
|
return -EPERM;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&uevent_sock_mutex);
|
|
|
|
ret = uevent_net_broadcast(net->uevent_sock->sk, skb, extack);
|
|
|
|
mutex_unlock(&uevent_sock_mutex);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void uevent_net_rcv(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
netlink_rcv_skb(skb, &uevent_net_rcv_skb);
|
|
|
|
}
|
|
|
|
|
2010-05-05 07:36:44 +07:00
|
|
|
static int uevent_net_init(struct net *net)
|
2005-11-11 20:43:07 +07:00
|
|
|
{
|
2010-05-05 07:36:44 +07:00
|
|
|
struct uevent_sock *ue_sk;
|
2012-06-29 13:15:21 +07:00
|
|
|
struct netlink_kernel_cfg cfg = {
|
|
|
|
.groups = 1,
|
2018-03-19 19:17:31 +07:00
|
|
|
.input = uevent_net_rcv,
|
|
|
|
.flags = NL_CFG_F_NONROOT_RECV
|
2012-06-29 13:15:21 +07:00
|
|
|
};
|
2010-05-05 07:36:44 +07:00
|
|
|
|
|
|
|
ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
|
|
|
|
if (!ue_sk)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2012-09-08 09:53:54 +07:00
|
|
|
ue_sk->sk = netlink_kernel_create(net, NETLINK_KOBJECT_UEVENT, &cfg);
|
2010-05-05 07:36:44 +07:00
|
|
|
if (!ue_sk->sk) {
|
2019-01-09 16:17:00 +07:00
|
|
|
pr_err("kobject_uevent: unable to create netlink socket!\n");
|
2010-05-25 16:51:10 +07:00
|
|
|
kfree(ue_sk);
|
2005-11-11 20:43:07 +07:00
|
|
|
return -ENODEV;
|
|
|
|
}
|
2018-03-19 19:17:30 +07:00
|
|
|
|
|
|
|
net->uevent_sock = ue_sk;
|
|
|
|
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
/* Restrict uevents to initial user namespace. */
|
|
|
|
if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
|
|
|
|
mutex_lock(&uevent_sock_mutex);
|
|
|
|
list_add_tail(&ue_sk->list, &uevent_sock_list);
|
|
|
|
mutex_unlock(&uevent_sock_mutex);
|
|
|
|
}
|
|
|
|
|
2005-11-11 20:43:07 +07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-05 07:36:44 +07:00
|
|
|
static void uevent_net_exit(struct net *net)
|
|
|
|
{
|
2018-03-19 19:17:30 +07:00
|
|
|
struct uevent_sock *ue_sk = net->uevent_sock;
|
2010-05-05 07:36:44 +07:00
|
|
|
|
netns: restrict uevents
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.
However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.
This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
1. Untagged kobjects - uevent_net_broadcast_untagged():
Untagged kobjects will be broadcast into all uevent sockets recorded
in uevent_sock_list, i.e. into all network namespacs owned by the
intial user namespace.
2. Tagged kobjects - uevent_net_broadcast_tagged():
Tagged kobjects will only be broadcast into the network namespace they
were tagged with.
Handling of tagged kobjects in 2. does not cause any semantic changes.
This is just splitting out the filtering logic that was handled by
kobj_bcast_filter() before.
Handling of untagged kobjects in 1. will cause a semantic change. The
reasons why this is needed and ok have been discussed in [1]. Here is a
short summary:
- Userspace ignores uevents from network namespaces that are not owned by
the intial user namespace:
Uevents are filtered by userspace in a user namespace because the
received uid != 0. Instead the uid associated with the event will be
65534 == "nobody" because the global root uid is not mapped.
This means we can safely and without introducing regressions modify the
kernel to not send uevents into all network namespaces whose owning
user namespace is not the initial user namespace because we know that
userspace will ignore the message because of the uid anyway.
I have a) verified that is is true for every udev implementation out
there b) that this behavior has been present in all udev
implementations from the very beginning.
- Thundering herd:
Broadcasting uevents into all network namespaces introduces significant
overhead.
All processes that listen to uevents running in non-initial user
namespaces will end up responding to uevents that will be meaningless
to them. Mainly, because non-initial user namespaces cannot easily
manage devices unless they have a privileged host-process helping them
out. This means that there will be a thundering herd of activity when
there shouldn't be any.
- Removing needless overhead/Increasing performance:
Currently, the uevent socket for each network namespace is added to the
global variable uevent_sock_list. The list itself needs to be protected
by a mutex. So everytime a uevent is generated the mutex is taken on
the list. The mutex is held *from the creation of the uevent (memory
allocation, string creation etc. until all uevent sockets have been
handled*. This is aggravated by the fact that for each uevent socket
that has listeners the mc_list must be walked as well which means we're
talking O(n^2) here. Given that a standard Linux workload usually has
quite a lot of network namespaces and - in the face of containers - a
lot of user namespaces this quickly becomes a performance problem (see
"Thundering herd" above). By just recording uevent sockets of network
namespaces that are owned by the initial user namespace we
significantly increase performance in this codepath.
- Injecting uevents:
There's a valid argument that containers might be interested in
receiving device events especially if they are delegated to them by a
privileged userspace process. One prime example are SR-IOV enabled
devices that are explicitly designed to be handed of to other users
such as VMs or containers.
This use-case can now be correctly handled since
commit 692ec06d7c92 ("netns: send uevent messages"). This commit
introduced the ability to send uevents from userspace. As such we can
let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
namespace of the network namespace of the netlink socket) userspace
process make a decision what uevents should be sent. This removes the
need to blindly broadcast uevents into all user namespaces and provides
a performant and safe solution to this problem.
- Filtering logic:
This patch filters by *owning user namespace of the network namespace a
given task resides in* and not by user namespace of the task per se.
This means if the user namespace of a given task is unshared but the
network namespace is kept and is owned by the initial user namespace a
listener that is opening the uevent socket in that network namespace
can still listen to uevents.
- Fix permission for tagged kobjects:
Network devices that are created or moved into a network namespace that
is owned by a non-initial user namespace currently are send with
INVALID_{G,U}ID in their credentials. This means that all current udev
implementations in userspace will ignore the uevent they receive for
them. This has lead to weird bugs whereby new devices showing up in such
network namespaces were not recognized and did not get IPs assigned etc.
This patch adjusts the permission to the appropriate {g,u}id in the
respective user namespace. This way udevd is able to correctly handle
such devices.
- Simplify filtering logic:
do_one_broadcast() already ensures that only listeners in mc_list receive
uevents that have the same network namespace as the uevent socket itself.
So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
patch therefore removes kobj_bcast_filter() and replaces
netlink_broadcast_filtered() with the simpler netlink_broadcast()
everywhere.
[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 17:44:12 +07:00
|
|
|
if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
|
|
|
|
mutex_lock(&uevent_sock_mutex);
|
|
|
|
list_del(&ue_sk->list);
|
|
|
|
mutex_unlock(&uevent_sock_mutex);
|
|
|
|
}
|
2010-05-05 07:36:44 +07:00
|
|
|
|
|
|
|
netlink_kernel_release(ue_sk->sk);
|
|
|
|
kfree(ue_sk);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct pernet_operations uevent_net_ops = {
|
|
|
|
.init = uevent_net_init,
|
|
|
|
.exit = uevent_net_exit,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init kobject_uevent_init(void)
|
|
|
|
{
|
|
|
|
return register_pernet_subsys(&uevent_net_ops);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-11-11 20:43:07 +07:00
|
|
|
postcore_initcall(kobject_uevent_init);
|
2006-04-25 20:37:26 +07:00
|
|
|
#endif
|