linux_dsm_epyc7002/Documentation
Hugh Dickins b87537d9e2 mm: rmap use pte lock not mmap_sem to set PageMlocked
KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
found mapped in a VM_LOCKED vma) is ineffective against races with
exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
tearing down an mm.

But that's okay, those races are benign; and although we've believed for
years in that ugly down_read_trylock(), it's unsuitable for the job, and
frustrates the good intention of setting PageMlocked when it fails.

It just doesn't matter if here we read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
syscalls (or exit) work their way up the address space (taking pt locks
after updating vm_flags) to establish the final state.

We do still need to be careful never to mark a page Mlocked (hence
unevictable) by any race that will not be corrected shortly after.  The
page lock protects from many of the races, but not all (a page is not
necessarily locked when it's unmapped).  But the pte lock we just dropped
is good to cover the rest (and serializes even with
munlock_vma_pages_all(), so no special barriers required): now hold on to
the pte lock while calling mlock_vma_page().  Is that lock ordering safe?
Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
calls the complementary clear_page_mlock().

This fixes the following case (though not a case which anyone has
complained of), which mmap_sem did not: truncation's preliminary
unmap_mapping_range() is supposed to remove even the anonymous COWs of
filecache pages, and that might race with try_to_unmap_one() on a
VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
freed.  The pte lock protects against this.

You could say that it also protects against the more ordinary case, racing
with the preliminary unmapping of a filecache page itself: but in our
current tree, that's independently protected by i_mmap_rwsem; and that
race would be why "Bad page state (mlocked)" was seen before commit
48ec833b78 ("Revert mm/memory.c: share the i_mmap_rwsem").

Vlastimil Babka points out another race which this patch protects against.
 try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
leaving PageMlocked and unevictable when it should be evictable.  mmap_sem
is ineffective because exit_mmap() does not hold it; page lock ineffective
because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
is effective because __munlock_pagevec_fill() takes it to get the page,
after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.

Kirill Shutemov points out that if the compiler chooses to implement a
"vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
with an intermediate store of unrelated bits set, since I'm here foregoing
its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
a spurious VM_LOCKED in vm_flags, and make the wrong decision.  This does
not appear to be an immediate problem, but we may want to define vm_flags
accessors in future, to guard against such a possibility.

While we're here, make a related optimization in try_to_munmap_one(): if
it's doing TTU_MUNLOCK, then there's no point at all in descending the
page tables and getting the pt lock, unless the vma is VM_LOCKED.  Yes,
that can change racily, but it can change racily even without the
optimization: it's not critical.  Far better not to waste time here.

Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
on this occasion, but that's probably the sensible next step - with a
rename, given that try_to_munlock()'s business is to try to set Mlocked.

Updated the unevictable-lru Documentation, to remove its reference to mmap
semaphore, but found a few more updates needed in just that area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
..
ABI char/misc drivers for 4.4-rc1 2015-11-04 22:15:15 -08:00
accounting Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
acpi ACPI / Documentation: Update method tracing documentation. 2015-08-07 02:42:22 +02:00
aoe
arm arm64 updates for 4.4: 2015-11-04 14:47:13 -08:00
arm64 arm64 updates for 4.4: 2015-11-04 14:47:13 -08:00
auxdisplay Documentation: use subdir-y to avoid unnecessary built-in.o files 2014-09-26 11:02:55 +02:00
backlight
blackfin Docs: blackfin: Use new switch macro SAMPLE_IRQ_TIMER instead of IRQ_TIMER5 2015-05-07 09:35:14 -06:00
block block: add an API for Persistent Reservations 2015-10-21 14:46:56 -06:00
blockdev zsmalloc: account the number of compacted pages 2015-09-08 15:35:28 -07:00
bus-devices
cdrom
cgroups Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-block 2015-09-10 18:56:14 -07:00
cma cma: debug: document new debugfs interface 2015-04-14 16:49:00 -07:00
connector
console
cpu-freq cpufreq: remove redundant CPUFREQ_INCOMPATIBLE notifier event 2015-09-01 15:50:38 +02:00
cpuidle
cris
crypto crypto: doc - AEAD / RNG AF_ALG interface 2015-03-09 21:06:18 +11:00
development-process Documentation: remove outdated references to the linux-next wiki 2014-10-28 09:06:11 -04:00
device-mapper - Revert a dm-multipath change that caused a regression for unprivledged 2015-11-04 21:19:53 -08:00
devicetree char/misc drivers for 4.4-rc1 2015-11-04 22:15:15 -08:00
dmaengine Documentation: dmaengine: Add DMA_CTRL_REUSE documentation 2015-08-17 13:46:22 +05:30
DocBook Staging driver update for 4.4-rc1 2015-11-04 21:40:53 -08:00
driver-model driver-core: platform: Provide helpers for multi-driver modules 2015-10-05 05:02:40 +01:00
dvb [media] get_dvb_firmware: Update firmware of ITEtech IT9135 2014-09-21 17:03:04 -03:00
early-userspace
EDID
extcon
fault-injection futex: Fault/error injection capabilities 2015-07-20 11:45:45 +02:00
fb Documentation/fb: add documentation for sm712fb 2015-08-07 15:05:01 -07:00
features arm64 updates for 4.4: 2015-11-04 14:47:13 -08:00
filesystems mm Documentation: undoc non-linear vmas 2015-11-05 19:34:48 -08:00
firmware_class
fmc
fpga usage documentation for FPGA manager core 2015-10-07 18:07:20 +01:00
frv
gpio gpio: add a real time compliance notes 2015-10-27 11:13:42 +01:00
hid HID: sensor: Update document for custom sensor 2015-04-10 22:22:56 +02:00
hwmon hwmon: (lm75) Add support for TMP75C 2015-10-14 07:57:14 -07:00
i2c i2c: support 10 bit and slave addresses in sysfs 'new_device' 2015-08-24 14:05:15 +02:00
ia64 virtual: Documentation: simplify and generalize paravirt_ops.txt 2015-02-13 17:15:44 +10:30
ide
infiniband IB/hfi1: add driver files 2015-08-28 22:59:36 -04:00
input Input: fix typo in MT documentation 2015-09-19 11:39:02 -07:00
ioctl char/misc drivers for 4.4-rc1 2015-11-04 22:15:15 -08:00
isdn
ja_JP Doc: ja_JP: Fix typo in HOWTO 2015-06-08 16:43:09 -06:00
kbuild modsign: Allow password to be specified for signing key 2015-08-07 16:26:14 +01:00
kdump kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
ko_KR
laptops Move freefall program from Documentation/ to tools/ 2015-06-08 16:42:07 -06:00
leds Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds 2015-07-01 19:09:11 -07:00
locking Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-11-03 16:10:43 -08:00
m68k
memory-devices
metag
mic misc: mic: Update MIC host daemon with COSM changes 2015-10-04 12:54:54 +01:00
mips
misc-devices mei: add async event notification ioctls 2015-08-03 17:30:00 -07:00
mmc mmc: core: Remove MMC_CLKGATE 2015-10-26 16:00:09 +01:00
mn10300
mtd
namespaces
netlabel
networking Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2015-10-30 20:51:56 +09:00
nfc NFC: Fix typo in nfc-hci.txt 2015-06-08 23:15:45 +02:00
nios2 Documentation: Add documentation for Nios2 architecture 2014-12-08 12:56:06 +08:00
nvdimm libnvdimm: Non-Volatile Devices 2015-06-26 11:23:38 -04:00
nvmem Documentation: nvmem: add nvmem api level and how-to doc 2015-08-05 13:43:45 -07:00
parisc
PCI The documentation tree update for 4.1. Numerous fixes, the overdue removal 2015-04-18 11:10:49 -04:00
pcmcia pcmcia: Fix typo in locking documentation 2015-08-07 14:34:58 +02:00
phy
platform
power PCI / PM: Update runtime PM documentation for PCI devices 2015-09-25 02:48:44 +02:00
powerpc SCSI misc on 20150901 2015-09-02 12:22:54 -07:00
pps Doc: pps: Fix file name in pps.txt 2015-07-14 12:35:42 -06:00
prctl Documentation/prctl: don't build tsc tests when cross compiling 2015-06-22 16:05:04 -06:00
pti
ptp testptp: Silence compiler warnings on ppc64 2015-09-29 21:16:56 -07:00
rapidio
RCU Merge branches 'doc.2015.10.06a', 'percpu-rwsem.2015.10.06a' and 'torture.2015.10.06a' into HEAD 2015-10-07 16:06:25 -07:00
s390 KVM: s390: remove outdated documentation 2015-07-29 11:02:35 +02:00
scheduler sched/dl/Documentation: Split Section 3 2015-05-19 08:39:21 +02:00
scsi Merge branch 'for-4.2/sg' of git://git.kernel.dk/linux-block 2015-06-25 15:22:36 -07:00
security Merge branch 'smack-for-4.3' of https://github.com/cschaufler/smack-next into next 2015-08-11 11:18:53 +10:00
serial Documentation: improve line discipline method descriptions 2015-10-05 04:53:26 +01:00
sh
sound Doc: sound:oss: Fix typo in sound/oss 2015-06-09 17:23:00 +02:00
spi Documentation/spi/spidev_test.c: fix warning 2015-04-17 09:04:12 -04:00
sysctl kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup 2015-11-05 19:34:48 -08:00
target Documentation/target: Fix tcm_mod_builder.py build breakage 2015-07-24 17:48:55 -07:00
thermal thermal: power_allocator: relax the requirement of two passive trip points 2015-09-14 07:41:45 -07:00
timers documentation: Update NO_HZ_FULL interaction with POSIX timers 2015-02-26 11:57:29 -08:00
tpm
trace intel_th: Add driver infrastructure for Intel(R) Trace Hub devices 2015-10-04 20:28:58 +01:00
usb usb: interface authorization: Documentation part 2015-09-22 12:08:40 -07:00
vDSO Documentation/vDSO: don't build tests when cross compiling 2015-06-22 16:04:57 -06:00
video4linux [media] Documentation: update cardlists 2015-06-09 18:43:47 -03:00
virtual Patch queue for ppc - 2015-08-22 2015-08-22 14:57:59 -07:00
vm mm: rmap use pte lock not mmap_sem to set PageMlocked 2015-11-05 19:34:48 -08:00
w1 w1: masters: omap_hdq: add support for 1-wire mode 2015-10-05 04:47:09 +01:00
watchdog Documentation/watchdog: add timeout and ping rate control to watchdog-test.c 2015-09-09 21:33:36 +02:00
wimax
x86 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-09-01 10:07:40 -07:00
xtensa
zh_CN Documentation updates for 4.2 2015-06-24 20:01:36 -07:00
00-INDEX Update of Documentation/dmaengine/00-INDEX 2014-12-29 15:28:24 -07:00
adding-syscalls.txt Documentation: describe how to add a system call 2015-08-13 17:54:06 -06:00
applying-patches.txt Documentation: change "&" to "and" in Documentation/applying-patches.txt 2014-09-26 11:10:11 +02:00
assoc_array.txt
atomic_ops.txt locking/atomics, cmpxchg: Privatize the inclusion of asm/cmpxchg.h 2015-09-13 10:35:46 +02:00
bad_memory.txt
basic_profiling.txt
bcache.txt
binfmt_misc.txt binfmt_misc: touch up documentation a bit 2014-10-14 02:18:16 +02:00
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt
cachetlb.txt rmap: drop support of non-linear mappings 2015-02-10 14:30:31 -08:00
Changes MODSIGN: Change from CMS to PKCS#7 signing if the openssl is too old 2015-09-25 16:31:46 +01:00
circular-buffers.txt
clk.txt clk: change clk_ops' ->determine_rate() prototype 2015-07-27 18:12:01 -07:00
coccinelle.txt
CodeOfConflict Code of Conflict 2015-02-27 11:44:24 -08:00
CodingStyle Documentation: CodingStyle: remove broken links in the References section 2015-07-10 13:54:34 -06:00
cpu-hotplug.txt cpumask: fix cpu-hotplug documentation 2015-03-05 13:37:01 +10:30
cpu-load.txt
cputopology.txt Documentation: Update cputopology.txt 2015-05-27 15:22:15 +02:00
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt Doc: Change wikipedia's URL from http to https 2015-06-22 10:14:05 -06:00
dell_rbu.txt
devices.txt
digsig.txt
DMA-API-HOWTO.txt Documentation updates for 4.2 2015-06-24 20:01:36 -07:00
DMA-API.txt mm: add dma_pool_zalloc() call to DMA API 2015-09-08 15:35:28 -07:00
DMA-attributes.txt
dma-buf-sharing.txt dma-buf: cleanup dma_buf_export() to make it easily extensible 2015-04-21 14:47:16 +05:30
DMA-ISA-LPC.txt
dontdiff
dynamic-debug-howto.txt
edac.txt Documentation/EDAC: Add reference documents section for amd64_edac 2015-09-29 13:42:41 +02:00
efi-stub.txt
eisa.txt
email-clients.txt fix Evolution submenu name in email-clients.txt 2015-07-24 15:05:57 +02:00
flexible-arrays.txt
futex-requeue-pi.txt doc: Fix misnamed FUTEX_CMP_REQUEUE_PI op constants 2015-01-19 12:05:32 +01:00
gcov.txt
gdb-kernel-debugging.txt scripts/gdb: add basic documentation 2015-02-17 14:34:54 -08:00
highuid.txt
HOWTO docs: update HOWTO for 3.x -> 4.x versioning 2015-08-24 11:28:17 -06:00
hsi.txt
hw_random.txt hwrng: doc - Fix device node name reference /dev/hw_random => /dev/hwrng 2015-09-21 22:00:41 +08:00
hwspinlock.txt hwspinlock/core: add device tree support 2015-05-02 09:54:30 +03:00
init.txt
initrd.txt
intel_txt.txt
Intel-IOMMU.txt x86/vt-d: Fix documentation of DRHD 2015-08-25 10:44:49 +02:00
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt ipmi:ssif: Ignore spaces when comparing I2C adapter names 2015-05-05 14:24:45 -05:00
IRQ-affinity.txt
IRQ-domain.txt irqdomain: Documentation updates 2015-10-13 19:01:25 +02:00
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kasan.txt Merge branch 'doc/4.2' into docs-next 2015-06-08 17:04:11 -06:00
kernel-doc-nano-HOWTO.txt Documenation: Update location of docproc.c 2015-07-14 12:36:39 -06:00
kernel-docs.txt
kernel-parameters.txt kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup 2015-11-05 19:34:48 -08:00
kernel-per-CPU-kthreads.txt documentation: Update per-CPU kthreads documentation 2015-02-26 11:57:30 -08:00
kmemcheck.txt Documentation: update the CONFIG_DEBUG_PAGEALLOC description 2015-03-20 07:41:55 -06:00
kmemleak.txt Doc: Change wikipedia's URL from http to https 2015-06-22 10:14:05 -06:00
kobject.txt kobject: grammar fix 2014-12-08 09:07:11 -05:00
kprobes.txt kprobes: Update Documentation/kprobes.txt 2015-03-20 07:41:55 -06:00
kref.txt
kselftest.txt kselftest: Move the docs to the Documentation dir 2014-11-24 10:49:54 -07:00
ldm.txt
local_ops.txt percpu: update local_ops.txt to reflect this_cpu operations 2014-12-13 12:42:53 -08:00
lockup-watchdogs.txt kernel/watchdog.c: add sysctl knob hardlockup_panic 2015-11-05 19:34:48 -08:00
logo.gif
logo.txt
lzo.txt Documentation: lzo: document part of the encoding 2014-09-28 11:08:00 +02:00
magic-number.txt Documentation/magic-number: Remove SCC_MAGIC 2015-05-13 15:39:04 -04:00
mailbox.txt Documentation: minor typo fix in mailbox.txt 2015-08-13 18:03:18 -06:00
Makefile Documentation: Remove ZBOOT MMC/SDHI utility and docs 2015-02-24 06:45:25 +09:00
ManagementStyle
md-cluster.txt md-cluster: fix deadlock issue on message lock 2015-08-31 19:41:41 +02:00
md.txt doc:md: fix typo in md.txt. 2015-06-23 06:49:44 -06:00
media-framework.txt
memory-barriers.txt atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl() 2015-11-03 17:22:17 -08:00
memory-hotplug.txt mem-hotplug: fix typo in Documentation/memory-hotplug.txt 2015-03-20 07:41:55 -06:00
men-chameleon-bus.txt Documentation: Minor changes to men-chameleon-bus.txt 2015-07-24 15:15:17 +02:00
module-signing.txt Move certificate handling to its own directory 2015-08-14 16:06:13 +01:00
mono.txt
nommu-mmap.txt fs: introduce f_op->mmap_capabilities for nommu mmap support 2015-01-20 14:02:58 -07:00
ntb.txt NTB: Rename Intel code names to platform names 2015-07-04 14:09:25 -04:00
numastat.txt
oops-tracing.txt livepatch: kernel: add TAINT_LIVEPATCH 2014-12-22 15:40:48 +01:00
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
phy.txt phy: core: Add devm_of_phy_get_by_index to phy-core 2015-05-11 21:42:23 +05:30
pi-futex.txt
pinctrl.txt pinctrl: move strict option to pinmux_ops 2015-05-06 14:45:19 +02:00
pnp.txt
preempt-locking.txt x86/fpu: Rename math_state_restore() to fpu__restore() 2015-05-19 15:47:18 +02:00
printk-formats.txt The documentation tree update for 4.1. Numerous fixes, the overdue removal 2015-04-18 11:10:49 -04:00
pwm.txt
ramoops.txt pstore-ram: Allow optional mapping with pgprot_noncached 2014-12-11 13:38:31 -08:00
rbtree.txt
remoteproc.txt remoteproc: introduce rproc_get_by_phandle API 2015-06-16 21:12:52 +03:00
rfkill.txt rfkill: document rfkill module parameters 2015-01-09 23:22:12 +01:00
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rtc.txt Documentation, split up rtc.txt into documentation and test file 2015-03-24 22:01:58 -06:00
SAK.txt
SecurityBugs
serial-console.txt
sgi-ioc4.txt
SM501.txt
smsc_ece1099.txt
sparse.txt
stable_api_nonsense.txt
stable_kernel_rules.txt stable: Update documentation to clarify preferred procedure 2015-05-22 09:38:56 -06:00
static-keys.txt locking/static_keys: Fix up the static keys documentation 2015-09-15 07:12:06 +02:00
SubmitChecklist
SubmittingDrivers
SubmittingPatches Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2015-09-01 18:46:42 -07:00
svga.txt
sysfs-rules.txt
sysrq.txt mm, oom: do not panic for oom kills triggered from sysrq 2015-09-08 15:35:28 -07:00
this_cpu_ops.txt Docs: this_cpu_ops: remove redundant add forms 2014-09-26 11:03:00 +02:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt vfio: powerpc/spapr: Support Dynamic DMA windows 2015-06-11 15:16:55 +10:00
VGA-softcursor.txt
vgaarbiter.txt
video-output.txt
vme_api.txt Documentation: mention vme_master_mmap() in VME API 2015-06-12 17:26:56 -07:00
volatile-considered-harmful.txt
workqueue.txt workqueue: fix trivial typo in Documentation/workqueue.txt 2015-05-05 09:50:38 -04:00
xillybus.txt xillybus: Move out of staging 2014-09-23 23:44:16 -07:00
xz.txt
zorro.txt