linux_dsm_epyc7002/lib
Christian Brauner a3498436b3 netns: restrict uevents
commit 07e98962fa ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
  1. Untagged kobjects - uevent_net_broadcast_untagged():
     Untagged kobjects will be broadcast into all uevent sockets recorded
     in uevent_sock_list, i.e. into all network namespacs owned by the
     intial user namespace.
  2. Tagged kobjects - uevent_net_broadcast_tagged():
     Tagged kobjects will only be broadcast into the network namespace they
     were tagged with.
  Handling of tagged kobjects in 2. does not cause any semantic changes.
  This is just splitting out the filtering logic that was handled by
  kobj_bcast_filter() before.
  Handling of untagged kobjects in 1. will cause a semantic change. The
  reasons why this is needed and ok have been discussed in [1]. Here is a
  short summary:
  - Userspace ignores uevents from network namespaces that are not owned by
    the intial user namespace:
    Uevents are filtered by userspace in a user namespace because the
    received uid != 0. Instead the uid associated with the event will be
    65534 == "nobody" because the global root uid is not mapped.
    This means we can safely and without introducing regressions modify the
    kernel to not send uevents into all network namespaces whose owning
    user namespace is not the initial user namespace because we know that
    userspace will ignore the message because of the uid anyway.
    I have a) verified that is is true for every udev implementation out
    there b) that this behavior has been present in all udev
    implementations from the very beginning.
  - Thundering herd:
    Broadcasting uevents into all network namespaces introduces significant
    overhead.
    All processes that listen to uevents running in non-initial user
    namespaces will end up responding to uevents that will be meaningless
    to them. Mainly, because non-initial user namespaces cannot easily
    manage devices unless they have a privileged host-process helping them
    out. This means that there will be a thundering herd of activity when
    there shouldn't be any.
  - Removing needless overhead/Increasing performance:
    Currently, the uevent socket for each network namespace is added to the
    global variable uevent_sock_list. The list itself needs to be protected
    by a mutex. So everytime a uevent is generated the mutex is taken on
    the list. The mutex is held *from the creation of the uevent (memory
    allocation, string creation etc. until all uevent sockets have been
    handled*. This is aggravated by the fact that for each uevent socket
    that has listeners the mc_list must be walked as well which means we're
    talking O(n^2) here. Given that a standard Linux workload usually has
    quite a lot of network namespaces and - in the face of containers - a
    lot of user namespaces this quickly becomes a performance problem (see
    "Thundering herd" above). By just recording uevent sockets of network
    namespaces that are owned by the initial user namespace we
    significantly increase performance in this codepath.
  - Injecting uevents:
    There's a valid argument that containers might be interested in
    receiving device events especially if they are delegated to them by a
    privileged userspace process. One prime example are SR-IOV enabled
    devices that are explicitly designed to be handed of to other users
    such as VMs or containers.
    This use-case can now be correctly handled since
    commit 692ec06d7c ("netns: send uevent messages"). This commit
    introduced the ability to send uevents from userspace. As such we can
    let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
    namespace of the network namespace of the netlink socket) userspace
    process make a decision what uevents should be sent. This removes the
    need to blindly broadcast uevents into all user namespaces and provides
    a performant and safe solution to this problem.
  - Filtering logic:
    This patch filters by *owning user namespace of the network namespace a
    given task resides in* and not by user namespace of the task per se.
    This means if the user namespace of a given task is unshared but the
    network namespace is kept and is owned by the initial user namespace a
    listener that is opening the uevent socket in that network namespace
    can still listen to uevents.
- Fix permission for tagged kobjects:
  Network devices that are created or moved into a network namespace that
  is owned by a non-initial user namespace currently are send with
  INVALID_{G,U}ID in their credentials. This means that all current udev
  implementations in userspace will ignore the uevent they receive for
  them. This has lead to weird bugs whereby new devices showing up in such
  network namespaces were not recognized and did not get IPs assigned etc.
  This patch adjusts the permission to the appropriate {g,u}id in the
  respective user namespace. This way udevd is able to correctly handle
  such devices.
- Simplify filtering logic:
  do_one_broadcast() already ensures that only listeners in mc_list receive
  uevents that have the same network namespace as the uevent socket itself.
  So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
  patch therefore removes kobj_bcast_filter() and replaces
  netlink_broadcast_filtered() with the simpler netlink_broadcast()
  everywhere.

[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 10:22:41 -04:00
..
842
fonts
lz4
lzo
mpi lib/mpi: Fix umul_ppmm() for MIPS64r6 2017-12-22 19:39:09 +11:00
raid6 powerpc updates for 4.17 2018-04-07 12:08:19 -07:00
reed_solomon
xz
zlib_deflate
zlib_inflate
zstd lib: zstd: clean up Makefile for simpler composite object handling 2018-03-26 02:01:27 +09:00
.gitignore
argv_split.c
ashldi3.c
ashrdi3.c
asn1_decoder.c
assoc_array.c
atomic64_test.c
atomic64.c
audit.c
bcd.c
bch.c
bitmap.c lib: fix stall in __bitmap_parselist() 2018-04-05 21:36:21 -07:00
bitrev.c
bsearch.c
btree.c btree: avoid variable-length allocations 2018-03-14 16:55:29 -07:00
bucket_locks.c
bug.c lib/bug.c: exclude non-BUG/WARN exceptions from report_bug() 2018-03-09 16:40:01 -08:00
build_OID_registry
bust_spinlocks.c
chacha20.c crypto: chacha20 - use rol32() macro from bitops.h 2018-01-12 23:03:01 +11:00
check_signature.c
checksum.c
clz_ctz.c
clz_tab.c
cmdline.c
cmpdi2.c
compat_audit.c
cordic.c
cpu_rmap.c
cpumask.c lib: optimize cpumask_next_and() 2018-02-06 18:32:44 -08:00
crc4.c
crc7.c
crc8.c
crc16.c
crc32.c
crc32defs.h
crc32test.c
crc-ccitt.c lib/crc-ccitt: Add CCITT-FALSE CRC16 variant 2018-01-08 10:08:33 +00:00
crc-itu-t.c
crc-t10dif.c
ctype.c
debug_info.c
debug_locks.c
debugobjects.c debugobjects: Avoid another unused variable warning 2018-03-14 20:20:01 +01:00
dec_and_lock.c
decompress_bunzip2.c
decompress_inflate.c
decompress_unlz4.c
decompress_unlzma.c
decompress_unlzo.c
decompress_unxz.c
decompress.c
devres.c devres: combine function devm_ioremap* 2018-03-15 18:08:55 +01:00
digsig.c
div64.c
dma-debug.c dma-debug: fix memory leak in debug_dma_alloc_coherent 2018-02-22 15:02:33 -08:00
dma-direct.c dma-mapping: Don't clear GFP_ZERO in dma_alloc_attrs 2018-03-28 17:34:23 +02:00
dma-virt.c
dump_stack.c printk: move dump stack related code to lib/dump_stack.c 2018-03-15 13:25:36 +01:00
dynamic_debug.c
dynamic_queue_limits.c
earlycpio.c
error-inject.c error-injection: Add injectable error types 2018-01-12 17:33:38 -08:00
errseq.c errseq: Add to documentation tree 2018-01-01 12:40:27 -07:00
extable.c
fault-inject.c
fdt_empty_tree.c
fdt_ro.c
fdt_rw.c
fdt_strerror.c
fdt_sw.c
fdt_wip.c
fdt.c
find_bit_benchmark.c lib: optimize cpumask_next_and() 2018-02-06 18:32:44 -08:00
find_bit.c lib: optimize cpumask_next_and() 2018-02-06 18:32:44 -08:00
flex_array.c
flex_proportions.c
gcd.c
gen_crc32table.c
genalloc.c
glob.c
globtest.c
hexdump.c
hweight.c
idr.c idr: Fix handling of IDs above INT_MAX 2018-02-26 14:39:30 -05:00
inflate.c
int_sqrt.c lib: Add strongly typed 64bit int_sqrt 2018-02-04 10:17:21 +00:00
interval_tree_test.c
interval_tree.c
iomap_copy.c
iomap.c
iommu-common.c
iommu-helper.c
ioremap.c mm/vmalloc: add interfaces to free unmapped page table 2018-03-22 17:07:01 -07:00
iov_iter.c
irq_poll.c
irq_regs.c
is_single_threaded.c
jedec_ddr_data.c
kasprintf.c
Kconfig lib: Add generic PIO mapping method 2018-03-21 17:18:34 -05:00
Kconfig.debug lib/Kconfig.debug: Debug Lockups and Hangs: keep SOFTLOCKUP options together 2018-04-11 10:28:35 -07:00
Kconfig.kasan kasan: rework Kconfig settings 2018-02-06 18:32:47 -08:00
Kconfig.kgdb
Kconfig.ubsan lib: add testing module for UBSAN 2018-04-11 10:28:35 -07:00
kfifo.c kfifo: fix inaccurate comment 2018-03-27 11:15:42 +02:00
klist.c
kobject_uevent.c netns: restrict uevents 2018-05-01 10:22:41 -04:00
kobject.c lib/kobject: Join string literals back 2018-03-15 14:38:55 +01:00
kstrtox.c
kstrtox.h
lcm.c
libcrc32c.c libcrc32c: Add crc32c_impl function 2018-03-26 15:09:38 +02:00
list_debug.c lib/list_debug.c: print unmangled addresses 2018-04-11 10:28:35 -07:00
list_sort.c
llist.c
locking-selftest-hardirq.h
locking-selftest-mutex.h
locking-selftest-rlock-hardirq.h
locking-selftest-rlock-softirq.h
locking-selftest-rlock.h
locking-selftest-rsem.h
locking-selftest-rtmutex.h
locking-selftest-softirq.h
locking-selftest-spin-hardirq.h
locking-selftest-spin-softirq.h
locking-selftest-spin.h
locking-selftest-wlock-hardirq.h
locking-selftest-wlock-softirq.h
locking-selftest-wlock.h
locking-selftest-wsem.h
locking-selftest.c
lockref.c lockref: Add lockref_put_not_zero 2018-04-12 09:41:19 -07:00
logic_pio.c lib: Add generic PIO mapping method 2018-03-21 17:18:34 -05:00
lru_cache.c
lshrdi3.c
Makefile lib: add testing module for UBSAN 2018-04-11 10:28:35 -07:00
memory-notifier-error-inject.c
memweight.c
muldi3.c
net_utils.c
netdev-notifier-error-inject.c
nlattr.c
nmi_backtrace.c
nodemask.c
notifier-error-inject.c
notifier-error-inject.h
of-reconfig-notifier-error-inject.c
oid_registry.c
once.c
parman.c
parser.c
pci_iomap.c PCI: Add SPDX GPL-2.0 when no license was specified 2018-01-26 11:45:16 -06:00
percpu_counter.c
percpu_ida.c
percpu_test.c
percpu-refcount.c percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods 2018-03-19 10:09:44 -07:00
plist.c
pm-notifier-error-inject.c
prime_numbers.c
radix-tree.c radix tree: use GFP_ZONEMASK bits of gfp_t for flags 2018-04-11 10:28:39 -07:00
random32.c
ratelimit.c
rational.c
rbtree_test.c
rbtree.c
reciprocal_div.c
refcount.c
rhashtable.c rhashtable: improve rhashtable_walk stability when stop/start used. 2018-04-24 13:21:46 -04:00
sbitmap.c sbitmap: use test_and_set_bit_lock()/clear_bit_unlock() 2018-02-28 12:23:35 -07:00
scatterlist.c lib/scatterlist: add sg_init_marker() helper 2018-03-30 22:50:15 +02:00
seq_buf.c
sg_pool.c
sg_split.c
sha1.c
sha256.c kernel/kexec_file.c: move purgatories sha256 to common code 2018-04-13 17:10:28 -07:00
show_mem.c
siphash.c
smp_processor_id.c lib: do not use print_symbol() 2018-01-05 15:24:00 +01:00
sort.c
stackdepot.c lib/stackdepot.c: use a non-instrumented version of memcmp() 2018-02-06 18:32:44 -08:00
stmp_device.c
string_helpers.c
string.c lib/strscpy: Shut up KASAN false-positives in strscpy() 2018-02-01 12:20:21 -08:00
strncpy_from_user.c
strnlen_user.c
swiotlb.c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-04-15 16:12:35 -07:00
syscall.c
test_bitmap.c lib/test_bitmap.c: do not accidentally use stack VLA 2018-04-11 10:28:35 -07:00
test_bpf.c test_bpf: Fix NULL vs IS_ERR() check in test_skb_segment() 2018-03-29 14:33:29 -04:00
test_debug_virtual.c
test_firmware.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
test_hash.c
test_hexdump.c
test_kasan.c kasan: fix invalid-free test crashing the kernel 2018-04-11 10:28:32 -07:00
test_kmod.c lib/test_kmod.c: fix limit check on number of test devices created 2018-03-09 16:40:02 -08:00
test_list_sort.c
test_module.c
test_parman.c
test_printf.c
test_rhashtable.c test_rhashtable: add test case for rhltable with duplicate objects 2018-03-07 10:44:03 -05:00
test_siphash.c
test_sort.c lib/test_sort.c: add module unload support 2018-02-06 18:32:45 -08:00
test_static_key_base.c
test_static_keys.c
test_string.c
test_sysctl.c
test_ubsan.c lib/test_ubsan.c: make test_ubsan_misaligned_access() static 2018-04-11 10:28:35 -07:00
test_user_copy.c treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
test_uuid.c
test-kstrtox.c
test-string_helpers.c
textsearch.c textsearch: fix kernel-doc warnings and add kernel-api section 2018-04-16 18:53:13 -04:00
timerqueue.c timerqueue: Document return values of timerqueue_add/del() 2017-12-29 23:13:10 +01:00
ts_bm.c
ts_fsm.c
ts_kmp.c
ubsan.c lib/ubsan: remove returns-nonnull-attribute checks 2018-02-06 18:32:46 -08:00
ubsan.h lib/ubsan: remove returns-nonnull-attribute checks 2018-02-06 18:32:46 -08:00
ucmpdi2.c
ucs2_string.c
usercopy.c
uuid.c
vsprintf.c proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps 2018-04-11 10:28:33 -07:00
win_minmax.c
xxhash.c