linux_dsm_epyc7002/Documentation
Andrea Arcangeli dd0db88d80 userfaultfd: non-cooperative: rollback userfaultfd_exit
Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".

Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing.  I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.

I dropped userfaultfd_exit as result.  I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.

Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state.  So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state.  The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.

While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs).  This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.

The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.

This patch (of 3):

I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.

    general protection fault: 0000 [#1] SMP
    Modules linked in:
    CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
    task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
    RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
    RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
    RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
    RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
    R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
    FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Call Trace:
      do_exit+0x297/0xd10
      SyS_exit+0x17/0x20
      tracesys+0xdd/0xe2
    Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
    RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
     RSP <ffff8801f833fe80>
    ---[ end trace 9fecd6dcb442846a ]---

In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected.  So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.

When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager).  So this is an use
after free that was happening for all processes.

One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.

One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.

The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:

	if (mmget_not_zero(ctx->mm)) {
[..]
	} else {
		return -ENOSPC;
	}

It's safer to roll it back and re-introduce it later if at all.

[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
  Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
..
ABI RTC for 4.11 2017-02-27 19:59:21 -08:00
accounting tools: move accounting tool from Documentation 2016-09-23 13:07:15 -06:00
acpi scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
admin-guide There was some breakage with the changes for jump labels in the 4.11 merge 2017-03-07 09:37:28 -08:00
aoe
arm arm: sunxi: add support for V3s SoC 2017-01-20 21:31:34 +01:00
arm64 arm64: Work around Falkor erratum 1003 2017-02-10 11:22:12 +00:00
auxdisplay samples: move auxdisplay example code from Documentation 2016-09-23 11:52:32 -06:00
backlight
blackfin samples: move blackfin gptimers-example from Documentation 2016-10-10 07:12:02 -06:00
block A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
blockdev scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
bus-devices
cdrom cdrom: Make device operations read-only 2017-02-14 08:29:56 -07:00
cgroup-v1 Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2017-02-27 21:41:08 -08:00
cma
connector samples: connector: from Documentation to samples directory 2016-04-28 07:47:35 -06:00
console
core-api Documentation: Update CPU hotplug and move it to core-api 2017-01-13 10:32:32 -07:00
cpu-freq A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
cpuidle
cris
crypto crypto: doc - fix typo 2017-02-15 13:23:49 +08:00
dev-tools scripts/spelling.txt: add "disble(d)" pattern and fix typo instances 2017-03-09 17:01:09 -08:00
device-mapper scripts/spelling.txt: add "explictely" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
devicetree scripts/spelling.txt: add "overide" pattern and fix typo instances 2017-03-09 17:01:09 -08:00
dmaengine dmaengine: Documentation: Fix typo in pxa_dma.txt 2016-11-14 08:14:24 +05:30
doc-guide docs-rst: parse-headers.pl: cleanup the documentation 2016-11-30 17:08:09 -07:00
DocBook A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
driver-api Less anger inducing pull request for 4.11 2017-02-23 18:58:18 -08:00
driver-model irqdesc: Add a resource managed version of irq_alloc_descs() 2017-02-10 14:39:20 +01:00
early-userspace
EDID
extcon extcon: int3496: Add Intel INT3496 ACPI device extcon driver 2017-01-09 10:04:11 +09:00
fault-injection
fb Documentation: fb: fix spelling mistakes 2016-05-10 12:05:27 +03:00
features 2nd round of ARC udpates for 4.10rc1 2016-12-23 10:22:47 -08:00
filesystems statx: Add a system call to make enhanced file info available 2017-03-02 20:51:15 -05:00
firmware_class firmware: revamp firmware documentation 2017-01-11 09:42:59 +01:00
fmc
fpga fpga: Add scatterlist based programming 2017-02-10 15:20:44 +01:00
frv docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
gpio gpio: random documentation update 2017-01-31 15:43:05 +01:00
gpu Less anger inducing pull request for 4.11 2017-02-23 18:58:18 -08:00
hid Documentation: HID: Intel ISH HID document 2016-08-17 11:13:07 +02:00
hwmon A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
i2c i2c: i801: Add support for Intel Gemini Lake 2017-02-09 17:39:16 +01:00
ia64 selftests: move ia64 tests from Documentation/ia64 2016-09-20 09:58:12 -06:00
ide
iio iio: Documentation: Correct the path used to create triggers. 2016-10-01 00:49:58 -06:00
infiniband IB/hfi1: Document new sysfs entries for hfi1 driver 2016-10-02 08:42:19 -04:00
input Documentation: input: fix path to input code definitions 2017-02-12 15:19:00 -07:00
ioctl rpmsg updates for v4.11 2017-02-23 09:41:03 -08:00
isdn docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
kbuild Kconfig: Introduce the "imply" keyword 2016-11-16 09:26:33 +01:00
kdump Documentation: kdump: Add description of enable multi-cpus support 2016-09-20 18:02:54 -06:00
laptops platform/x86: thinkpad_acpi: Add support for X1 Yoga (2016) Tablet Mode 2016-12-13 09:29:06 -08:00
leds leds: class: Add new optional brightness_hw_changed attribute 2017-01-29 19:59:42 +01:00
livepatch A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
locking locking/ww_mutex/Documentation: Update the design document 2017-01-14 11:14:55 +01:00
m68k docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
md MD: add doc for raid5-cache 2017-02-13 09:17:54 -08:00
media A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
memory-devices
metag
mic samples: move mic/mpssd example code from Documentation 2016-09-20 12:38:48 -06:00
mips
misc-devices samples: move misc-devices/mei example code from Documentation 2016-09-23 11:51:43 -06:00
mmc mmc: core: Extend sysfs with DSR register 2016-07-25 10:34:51 +02:00
mn10300
mtd spi-nor: Add support for Intel SPI serial flash controller 2017-01-03 17:33:36 +00:00
namespaces
netlabel
networking scripts/spelling.txt: add "an user" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
nfc
nios2
nvdimm libnvdimm, btt: update the usage section in Documentation 2016-06-17 16:23:23 -07:00
nvmem
parisc
PCI A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
pcmcia tools: move pcmcia crc32hash tool from Documentation 2016-09-23 13:07:27 -06:00
perf perf: add qcom l2 cache perf events driver 2017-02-08 19:32:24 +00:00
phy
platform
power More power management updates for v4.11-rc1 2017-03-02 17:33:52 -08:00
powerpc powerpc updates for 4.9 2016-10-07 20:19:31 -07:00
pps Doc: clarify source of jitter in USB1.1, and USB2.0 2017-01-04 14:40:52 -07:00
prctl selftests: move prctl tests from Documentation/prctl 2016-09-20 09:09:09 -06:00
process Doc: Correct typo, "Introdution" => "Introduction" 2016-12-01 10:44:08 -07:00
pti
ptp selftests: move ptp tests from Documentation/ptp 2016-09-20 09:54:38 -06:00
rapidio rapidio/documentation/mport_cdev: add missing parameter description 2016-09-01 17:52:02 -07:00
RCU Merge branches 'doc.2017.01.15b', 'dyntick.2017.01.23a', 'fixes.2017.01.23a', 'srcu.2017.01.25a' and 'torture.2017.01.15b' into HEAD 2017-01-25 12:56:05 -08:00
s390 Documentation: Update path to sysrq.txt 2017-03-03 15:48:38 -07:00
scheduler sched/Documentation/sched-rt-group: Fix incorrect example 2017-01-22 10:34:17 +01:00
scsi scripts/spelling.txt: add "varible" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
security KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload() 2017-03-02 10:09:00 +11:00
serial Documentation: rs485: Do not define manually the ioctl 2016-08-18 11:08:33 -06:00
sh
sound scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
sparc Documentation/sparc: Steps for sending break on sunhv console 2017-02-23 08:27:25 -08:00
sphinx docs: sphinx-extensions: make rstFlatTable work with docutils 0.13 2016-12-18 13:30:29 -07:00
sphinx-static This is the documentation update pull for the 4.9 merge window. 2016-10-04 13:54:07 -07:00
spi spi: spi-ep93xx: simplify GPIO chip selects 2017-02-16 20:10:26 +00:00
sysctl A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
target target: make close_session optional 2016-05-10 01:19:26 -07:00
thermal Documentation: fix spelling mistakes of "Celcius" -- > "Celsius" 2017-01-04 14:36:17 -07:00
timers time: Remove CONFIG_TIMER_STATS 2017-02-10 11:15:08 +01:00
trace perf/core: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS 2017-03-01 10:26:39 +01:00
translations doc/ko_KR/memory-barriers: Update control-dependencies section 2017-03-03 15:54:55 -07:00
usb A slightly quieter cycle for documentation this time around. 2017-02-22 18:51:29 -08:00
virtual A few fixes for the docs tree, including one for a 4.11 build regression. 2017-03-04 11:32:18 -08:00
vm userfaultfd: non-cooperative: rollback userfaultfd_exit 2017-03-09 17:01:09 -08:00
w1 w1: add ability to set (SRAM) and store (EEPROM) configuration for temp sensors like DS18B20 2016-05-01 14:37:49 -07:00
watchdog watchdog: Introduce watchdog_stop_on_unregister helper 2017-02-24 14:00:23 -08:00
wimax
x86 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-02-28 11:46:00 -08:00
xtensa xtensa: cleanup MMU setup and kernel layout macros 2016-07-24 06:33:58 +03:00
.gitignore Add .pyc files to .gitignore 2016-06-30 13:07:33 -06:00
00-INDEX Documentation: move MD related doc into a separate dir 2017-02-13 09:17:53 -08:00
bcache.txt bcache: documentation formatting, edited for clarity, stripe alignment notes 2016-06-23 07:58:38 -06:00
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
cgroup-v2.txt Merge branch 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11 2017-02-02 13:50:35 -05:00
Changes docs: add back 'Documentation/Changes' file (as symlink) 2016-12-14 16:30:12 -08:00
circular-buffers.txt Documentation: circular-buffers: use READ_ONCE() 2016-11-16 16:17:45 -07:00
clk.txt Documentation: clk: update file names containing referenced structures 2016-08-14 12:12:36 -06:00
CodingStyle doc: re-add CodingStyle and SubmittingPatches 2016-10-24 08:12:35 -02:00
conf.py Documentation/sphinx: fix primary_domain configuration 2017-03-03 16:12:30 -07:00
cpu-load.txt
cputopology.txt topology/sysfs: provide drawer id and siblings attributes 2016-06-13 15:58:27 +02:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
digsig.txt
DMA-API-HOWTO.txt Documentation: DMA-API-HOWTO: Fix a typo 2016-09-20 17:58:46 -06:00
DMA-API.txt dma-mapping: add dma_{map,unmap}_resource 2016-09-26 22:16:41 +05:30
DMA-attributes.txt common: DMA-mapping: add DMA_ATTR_PRIVILEGED attribute 2017-01-19 15:56:19 +00:00
DMA-ISA-LPC.txt Documentation: DMA-ISA-LPC.txt 2017-02-12 15:20:07 -07:00
docutils.conf doc-rst: add docutils config file 2016-08-14 11:52:40 -06:00
dontdiff Documentation: dontdiff: Update with additional entries 2017-01-26 15:08:06 -07:00
efi-stub.txt
eisa.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcc-plugins.txt GCC plugin infrastructure 2016-06-07 22:57:10 +02:00
highuid.txt
hw_random.txt
hwspinlock.txt
index.rst docs/zh_CN: Add coding-style into docs build system 2017-01-26 15:30:34 -07:00
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt Documentation: Fix a typo in IPMI.txt. 2017-01-05 15:01:54 -06:00
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
isa.txt Documentation: Add ISA bus driver documentation 2016-05-02 09:32:04 -07:00
isapnp.txt
kernel-doc-nano-HOWTO.txt docs-rst: doc-guide: split the kernel-documentation.rst contents 2016-11-19 10:22:04 -07:00
kernel-per-CPU-kthreads.txt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
kobject.txt
kprobes.txt Documentation: kprobes: Document jprobes stack copying limitations 2016-08-15 10:19:11 -06:00
kref.txt
kselftest.txt scripts/spelling.txt: add "an user" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
ldm.txt
lockup-watchdogs.txt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
logo.gif
logo.txt
lzo.txt Documentation: lzo: fix spelling mistakes 2016-04-28 07:23:11 -06:00
mailbox.txt
Makefile samples: move blackfin gptimers-example from Documentation 2016-10-10 07:12:02 -06:00
Makefile.sphinx Add a target to check broken external links in the Documentation 2017-02-15 15:22:47 -07:00
memory-barriers.txt doc: Update control-dependencies section of memory-barriers.txt 2017-01-14 21:29:15 -08:00
memory-hotplug.txt scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
men-chameleon-bus.txt
nommu-mmap.txt
ntb.txt
numastat.txt
padata.txt
parport-lowlevel.txt
percpu-rw-semaphore.txt
phy.txt phy: core: Allow children node to be overridden 2016-04-29 16:39:39 +02:00
pi-futex.txt
pinctrl.txt pinctrl: core: Fix regression caused by delayed work for hogs 2017-01-13 16:25:17 +01:00
pnp.txt
preempt-locking.txt
printk-formats.txt
pwm.txt pwm: Update documentation 2016-05-17 14:48:04 +02:00
rbtree.txt
remoteproc.txt remoteproc: Split driver and consumer dereferencing 2016-10-02 22:50:21 -07:00
rfkill.txt docs: fix locations of several documents that got moved 2016-10-24 08:12:35 -02:00
robust-futex-ABI.txt
robust-futexes.txt Documentation: robust-futexes: fix spelling mistakes 2016-04-28 07:26:41 -06:00
rpmsg.txt rpmsg: use module_rpmsg_driver in existing drivers and examples 2016-05-06 11:09:01 -07:00
rtc.txt
SAK.txt
sgi-ioc4.txt
siphash.txt siphash: implement HalfSipHash1-3 for hash tables 2017-01-09 13:58:57 -05:00
SM501.txt
smsc_ece1099.txt
static-keys.txt jump_label: Reduce the size of struct static_key 2017-02-15 09:02:26 -05:00
SubmittingPatches doc: re-add CodingStyle and SubmittingPatches 2016-10-24 08:12:35 -02:00
svga.txt
sync_file.txt dma-buf: Rename struct fence to dma_fence 2016-10-25 14:40:39 +02:00
this_cpu_ops.txt
unaligned-memory-access.txt Documentation/unaligned-memory-access.txt: fix incorrect comparison operator 2016-12-27 13:08:42 -07:00
unshare.txt
vfio-mediated-device.txt vfio-mdev: Make mdev_parent private 2016-12-30 08:13:41 -07:00
vfio.txt
video-output.txt
xillybus.txt Documentation: xillybus: fix spelling mistake 2016-04-28 07:44:54 -06:00
xz.txt
zorro.txt