Add a flag which causes page-types to use the kernels's idle page
tracking to mark pages idle. As the tool already prints the idle flag
if set, subsequent runs will show which pages have been accessed since
last run.
[akpm@linux-foundation.org: simplify mark_page_idle()]
[chansen3@cisco.com: reorganize mark_page_idle() logic, add docs]
Link: http://lkml.kernel.org/r/20180706172237.21691-1-chansen3@cisco.com
Link: http://lkml.kernel.org/r/20180612153223.13174-1-chansen3@cisco.com
Signed-off-by: Christian Hansen <chansen3@cisco.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new flag that will read kpagecount for each PFN and print out the
number of times the page is mapped along with the flags in the listing
view.
This information is useful in understanding and optimizing memory usage.
Identifying pages which are not shared allows us to focus on adjusting
the memory layout or access patterns for the sole owning process.
Knowing the number of processes that share a page tells us how many
other times we must make the same adjustments or how many processes to
potentially disable.
Truncated sample output:
voffset map-cnt offset len flags
561a3591e 1 15fe8 1 ___U_lA____Ma_b___________________________
561a3591f 1 2b103 1 ___U_lA____Ma_b___________________________
561a36ca4 1 2cc78 1 ___U_lA____Ma_b___________________________
7f588bb4e 14 2273c 1 __RU_lA____M______________________________
[akpm@linux-foundation.org: coding-style fixes]
[chansen3@cisco.com: add documentation, tweak whitespace]
Link: http://lkml.kernel.org/r/20180705181204.5529-1-chansen3@cisco.com
Link: http://lkml.kernel.org/r/20180612153205.12879-1-chansen3@cisco.com
Signed-off-by: Christian Hansen <chansen3@cisco.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlt1f9AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxbdhAArnhRvkwOk4m4/LCuKF6HpmlxbBNC
TjnBCenNf+lFXzWskfDFGFl/Wif4UzGbRTSCNQrwMzj3Ww3f/6R2QIq9rEJvyNC4
VdxQnaBEZSUgN87q5UGqgdjMTo3zFvlFH6fpb5XDiQ5IX/QZeXeYqoB64w+HvKPU
M+IsoOvnA5gb7pMcpchrGUnSfS1e6AqQbbTt6tZflore6YCEA4cH5OnpGx8qiZIp
ut+CMBvQjQB01fHeBc/wGrVte4NwXdONrXqpUb4sHF7HqRNfEh0QVyPhvebBi+k1
kquqoBQfPFTqgcab31VOcQhg70dEx+1qGm5/YBAwmhCpHR/g2gioFXoROsr+iUOe
BtF6LZr+Y8cySuhJnkCrJBqWvvBaKbJLg0KMbI+7p4o9MZpod2u7LS5LFrlRDyKW
3nz3o+b1+v3tCCKVKIhKo0ljolgkweQtR1f6KIHvq93wBODHVQnAOt9NlPfHVyks
ryGBnOhMjoU5hvfexgIWFk9Ph9MEVQSffkI+TeFPO/tyGBfGfQyGtESiXuEaMQaH
FGdZHX2RLkY3pWHOtWeMzRHzOnr2XjpDFcAqL3HBGPdJ30K3Umv3WOgoFe2SaocG
0gaddPjKSwwM4Sa/VP+O5cjGuzi7QnczSDdpYjxIGZzBav32hqx4/rsnLw7bHH8y
XkEme7cYJc8MGsA=
=2Dmn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
- Decode AER errors with names similar to "lspci" (Tyler Baicar)
- Expose AER statistics in sysfs (Rajat Jain)
- Clear AER status bits selectively based on the type of recovery (Oza
Pawandeep)
- Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
Gagniuc)
- Don't clear AER status bits if we're using the "Firmware-First"
strategy where firmware owns the registers (Alexandru Gagniuc)
- Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
Shevchenko)
- Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas)
- Defer DPC event handling to work queue (Keith Busch)
- Use threaded IRQ for DPC bottom half (Keith Busch)
- Print AER status while handling DPC events (Keith Busch)
- Work around IDT switch ACS Source Validation erratum (James
Puthukattukaran)
- Emit diagnostics for all cases of PCIe Link downtraining (Links
operating slower than they're capable of) (Alexandru Gagniuc)
- Skip VFs when configuring Max Payload Size (Myron Stowe)
- Reduce Root Port Max Payload Size if necessary when hot-adding a
device below it (Myron Stowe)
- Simplify SHPC existence/permission checks (Bjorn Helgaas)
- Remove hotplug sample skeleton driver (Lukas Wunner)
- Convert pciehp to threaded IRQ handling (Lukas Wunner)
- Improve pciehp tolerance of missed events and initially unstable
links (Lukas Wunner)
- Clear spurious pciehp events on resume (Lukas Wunner)
- Add pciehp runtime PM support, including for Thunderbolt controllers
(Lukas Wunner)
- Support interrupts from pciehp bridges in D3hot (Lukas Wunner)
- Mark fall-through switch cases before enabling -Wimplicit-fallthrough
(Gustavo A. R. Silva)
- Move DMA-debug PCI init from arch code to PCI core (Christoph
Hellwig)
- Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
supplied (Heiner Kallweit)
- Unify PCI and DMA direction #defines (Shunyong Yang)
- Add PCI_DEVICE_DATA() macro (Andy Shevchenko)
- Check for VPD completion before checking for timeout (Bert Kenward)
- Limit Netronome NFP5000 config space size to work around erratum
(Jakub Kicinski)
- Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)
- Document ACPI description of PCI host bridges (Bjorn Helgaas)
- Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
peer-to-peer DMA support (we don't have the peer-to-peer support yet;
this is just one piece) (Logan Gunthorpe)
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
- Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)
- Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)
- To avoid bus errors, enable PASID only if entire path supports
End-End TLP prefixes (Sinan Kaya)
- Unify slot and bus reset functions and remove hotplug knowledge from
callers (Sinan Kaya)
- Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
fix guest reboot issues (Alex Williamson)
- Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
Controller (Bjorn Helgaas)
- Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)
- Remove Aardvark outbound window configuration (Evan Wang)
- Fix Aardvark bridge window sizing issue (Zachary Zhang)
- Convert Aardvark to use pci_host_probe() to reduce code duplication
(Thomas Petazzoni)
- Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)
- Add Cadence support for optional generic PHYs (Alan Douglas)
- Add Cadence power management ops (Alan Douglas)
- Remove redundant variable from Cadence driver (Colin Ian King)
- Add Kirin MSI support (Xiaowei Song)
- Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
Guo)
- Move link notification settings from DesignWare core to individual
drivers (Gustavo Pimentel)
- Add endpoint library MSI-X interfaces (Gustavo Pimentel)
- Correct signature of endpoint library IRQ interfaces (Gustavo
Pimentel)
- Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)
- Add endpoint library MSI-X test support (Gustavo Pimentel)
- Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
(Jia-Ju Bai)
- Add more devices to Broadcom PAXC quirk (Ray Jui)
- Work around corrupted Broadcom PAXC config space to enable SMMU and
GICv3 ITS (Ray Jui)
- Disable MSI parsing to work around broken Broadcom PAXC logic in some
devices (Ray Jui)
- Hide unconfigured functions to work around a Broadcom PAXC defect
(Ray Jui)
- Lower iproc log level to reduce console output during boot (Ray Jui)
- Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)
- Fix mobiveil missing include file (Lorenzo Pieralisi)
- Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)
- Fix mvebu I/O space remapping issues (Thomas Petazzoni)
- Use generic pci_host_bridge in mvebu instead of ARM-specific API
(Thomas Petazzoni)
- Whitelist VMD devices with fast interrupt handlers to avoid sharing
vectors with slow handlers (Keith Busch)
* tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
PCI/AER: Don't clear AER bits if error handling is Firmware-First
PCI: Limit config space size for Netronome NFP5000
PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
PCI/VPD: Check for VPD access completion before checking for timeout
PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
PCI: Match Root Port's MPS to endpoint's MPSS as necessary
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
PCI: Check for PCIe Link downtraining
PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
PCI: Add device-specific ACS Redirect disable infrastructure
PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
PCI: Allow specifying devices using a base bus and path of devfns
PCI: Make specifying PCI devices in kernel parameters reusable
PCI: Hide ACS quirk declarations inside PCI core
PCI: Delay after FLR of Intel DC P3700 NVMe
PCI: Disable Samsung SM961/PM961 NVMe before FLR
PCI: Export pcie_has_flr()
PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
...
initializing hashed pointers and (optionally, controlled by a config
option) to initialize the CRNG to avoid boot hangs.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAltyBEkACgkQ8vlZVpUN
gaNZ6wgAhNyYLr2c0V0UnQyvguZXcJLBerqqGh9XvG//66kXUvYfT0NJSd2i7DZ/
u4ypf9NxfG4/emg2DDy3r+K/UjhgCIKKjzfp2MzYeEptJGg9V9EV7v1YtFJYs39g
cPmFv1l7fPNqe3qXXsbuZe2pSnJfEfzHeOStDNrEX1CJStt+LC7HRz1/dIcgycOa
CsB3yILQpgxu9HcVCfIeDtxjly7GQYTJKQGLAe/8MdatZ96HW/E4obvnDZhuFtCH
54OumcKhFXiODFLpBsK3Bllk2v9fO1Gq/SuYmNA85mXqbZVAUV2YNZK2HWASXwkB
NxwRcfLywgqfYmtvpp63rHSjJB76AQ==
=l9HN
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random updates from Ted Ts'o:
"Some changes to trust cpu-based hwrng (such as RDRAND) for
initializing hashed pointers and (optionally, controlled by a config
option) to initialize the CRNG to avoid boot hangs"
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: Make crng state queryable
random: remove preempt disabled region
random: add a config option to trust the CPU's hwrng
vsprintf: Add command line option debug_boot_weak_hash
vsprintf: Use hw RNG for ptr_key
random: Return nbytes filled from hw RNG
random: Fix whitespace pre random-bytes work
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
* pci/resource:
PCI: Make pci_get_rom_size() static
PCI: Add check code for last image indicator not set
PCI: Avoid accessing memory outside the ROM BAR
PCI: Make early dump functionality generic
PCI: Cleanup PCI_REBAR_CTRL_BAR_SHIFT handling
PCI: Restore resized BAR state on resume
PCI: Clean up resource allocation in devm_of_pci_get_host_bridge_resources()
# Conflicts:
# Documentation/admin-guide/kernel-parameters.txt
- add "hardened_usercopy=off" rare performance needs (Chris von Recklinghausen)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltx6hAWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJrrsEAChFhTgko1nNKYhks9KIIMZ7YWc
bCWpXMnBkmTbPa192a/4aDvvwuor5EFDavWY+vEciOvT2iY6h6uus/BzKB5JlHZ9
QsZS2uLr6SJX76Ri2r8alWT0hWovp/tFopXfnFt4fOHgSK+6rcWJRFzFefsZkcYd
xNEw2HnS0kYpgw0aEe3BsnsEn6u0/CxzyGTv6OLcnXU5riOkFUqm8ehLSA44aJW4
cfqWmdelfhvs0thR0rJItUUUmhVM3i6Zccvv0HCt6z8Xz9LIZgyxnnD9Ac7mGz8y
WjNPipLqXhu8/JVsd0Y6GK6b8bYh8uNID20fgr/6aWDZkOvUHe54/ChCkjs7cW6F
JWGn1hS1tg75rdw09tr4POVw4tUIe1JcqCfsJ7IzXA7oc6PsXzlGl8USDtK9f/fK
ryC60NQKo1dXGlY+18i1iw7HsMuWbtaIiWf8Zudy7JethDn3RbHshyF5tGpx0nFB
/qRTtMaC5WqIfZAbVb1Qou71gJzmS+k/RjltCO0AnhZrvFr0Qq3eQKRTkGhzOKRq
1dvOHb9ScNeehlQeaC+k0mm8ANf16gzXSGmGg3Z/7LfECbCqc7R7B767dN52hx2X
48P5cDNKUuXgHNk+p20Yr5m16oJDkAOxSHvFN9Kizy/eL7RbgOZREQcB4an9S+A0
yb6uQKU9CQ3n/NSZyA==
=j2xG
-----END PGP SIGNATURE-----
Merge tag 'hardened-usercopy-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy updates from Kees Cook:
"This cleans up a minor Kconfig issue and adds a kernel boot option for
disabling hardened usercopy for distro users that may have corner-case
performance issues (e.g. high bandwidth small-packet UDP traffic).
Summary:
- drop unneeded Kconfig "select BUG" (Kamal Mostafa)
- add "hardened_usercopy=off" rare performance needs (Chris von
Recklinghausen)"
* tag 'hardened-usercopy-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usercopy: Allow boot cmdline disabling of hardening
usercopy: Do not select BUG with HARDENED_USERCOPY
small fixes and updates. We also have new ktime_get_*() docs from Arnd,
some kernel-doc fixes, a new set of Italian translations (non so se vale la
pena, ma non fa male - speriamo bene), and some extensive early
memory-management documentation improvements from Mike Rapoport.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbcZVtAAoJEI3ONVYwIuV64ekP/jgTlMi/fErRu6zlsrsWgiIK
ir8ueCQ1OSiwOA+N2fUb+2+zlLlfTLgQ+o5IwmZk6rizG87fQ3Rp6i+bvYAZWITh
YUuls3VhtRlJZqG1EW7gww1Q2IhRO6GhcpsIamAvhrSLFPaCKiN3JomJi/X47Pfj
Ibl24HsruI2fDM1JwWRwCtE5J6vCL9lH1/5v4zVv7xdrVgTrwkZ/hAsE7HBNNat5
dSku2u9HSAXa4KR4sLWrVJ8UI5+fylwilz/57HhCeduQDwKCHE/mfhxLdqL4Oa4q
oHTCNq2zTUj4w7GTvHS1g0P3y/iWMYjAzH2is+BokilpIC65NwwsKx2ybZd3Srdh
zwP/kYk5U+mYSgdDlyNqwPCibw8KDXB3srKMzyQSN6tkosKCOHFSXF0Js0eupi7t
NqmGigl3Qozj1uvU6Wy7vh58u+GFeuO4PF566t2m70Jp0cWzuVKLrBvgNO1X37rB
aEBrpOYB/H54t/qf79IFW//pptWXFNZ3S9AgyDVIcmX5C2ihaCoaPNRTom+KbH/D
QEoH9rwWSoCi2DGoR83D+G8thCUfB4yfEGulSSIA4pUR7qvIR5rd1ZioI/qtgAHm
l7MjTbLpPwiMnpFkBrxxxlFFb4gbETakMBGYoYee8ww5WbQLu0qA93AbwIXyjhE8
mqCOLyBdCAZ0mNxqPSsc
=x/P0
-----END PGP SIGNATURE-----
Merge tag 'docs-4.19' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"This was a moderately busy cycle for docs, with the usual collection
of small fixes and updates.
We also have new ktime_get_*() docs from Arnd, some kernel-doc fixes,
a new set of Italian translations (non so se vale la pena, ma non fa
male - speriamo bene), and some extensive early memory-management
documentation improvements from Mike Rapoport"
* tag 'docs-4.19' of git://git.lwn.net/linux: (52 commits)
Documentation: corrections to console/console.txt
Documentation: add ioctl number entry for v4l2-subdev.h
Remove gendered language from management style documentation
scripts/kernel-doc: Escape all literal braces in regexes
docs/mm: add description of boot time memory management
docs/mm: memblock: add overview documentation
docs/mm: memblock: add kernel-doc description for memblock types
docs/mm: memblock: add kernel-doc comments for memblock_add[_node]
docs/mm: memblock: update kernel-doc comments
mm/memblock: add a name for memblock flags enumeration
docs/mm: bootmem: add overview documentation
docs/mm: bootmem: add kernel-doc description of 'struct bootmem_data'
docs/mm: bootmem: fix kernel-doc warnings
docs/mm: nobootmem: fixup kernel-doc comments
mm/bootmem: drop duplicated kernel-doc comments
Documentation: vm.txt: Adding 'nr_hugepages_mempolicy' parameter description.
doc:it_IT: translation for kernel-hacking
docs: Fix the reference labels in Locking.rst
doc: tracing: Fix a typo of trace_stat
mm: Introduce new type vm_fault_t
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.
This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...
To support peer-to-peer traffic on a segment of the PCI hierarchy, we must
disable the ACS redirect bits for select PCI bridges. The bridges must be
selected before the devices are discovered by the kernel and the IOMMU
groups created. Therefore, add a kernel command line parameter to specify
devices which must have their ACS bits disabled.
The new parameter takes a list of devices separated by a semicolon. Each
device specified will have its ACS redirect bits disabled. This is
similar to the existing 'resource_alignment' parameter.
The ACS Request P2P Request Redirect, P2P Completion Redirect and P2P
Egress Control bits are disabled, which is sufficient to always allow
passing P2P traffic uninterrupted. The bits are set after the kernel
(optionally) enables the ACS bits itself. It is also done regardless of
whether the kernel or platform firmware sets the bits.
If the user tries to disable the ACS redirect for a device without the ACS
capability, print a warning to dmesg.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
[bhelgaas: reorder to add the generic code first and move the
device-specific quirk to subsequent patches]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
When specifying PCI devices on the kernel command line using a
bus/device/function address, bus numbers can change when adding or
replacing a device, changing motherboard firmware, or applying kernel
parameters like "pci=assign-buses". When bus numbers change, it's likely
the command line tweak will be applied to the wrong device.
Therefore, it is useful to be able to specify devices with a base bus
number and the path of devfns needed to get to it, similar to the "device
scope" structure in the Intel VT-d spec, Section 8.3.1.
Thus, we add an option to specify devices in the following format:
[<domain>:]<bus>:<device>.<func>[/<device>.<func>]*
The path can be any segment within the PCI hierarchy of any length and
determined through the use of 'lspci -t'. When specified this way, it is
less likely that a renumbered bus will result in a valid device
specification and the tweak won't be applied to the wrong device.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
[bhelgaas: use "device" instead of "slot" in documentation since that's the
usual language in the PCI specs]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Separate out the code to match a PCI device with a string (typically
originating from a kernel parameter) from the
pci_specified_resource_alignment() function into its own helper function.
While we are at it, this change fixes the kernel style of the function
(fixing a number of long lines and extra parentheses).
Additionally, make the analogous change to the kernel parameter
documentation: Separate the description of how to specify a PCI device
into its own section at the head of the "pci=" parameter.
This patch should have no functional alterations.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
[bhelgaas: use "device" instead of "slot" in documentation since that's the
usual language in the PCI specs]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry. Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
Add the relevant Documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dave reported, that it's not confirmed that Yonah processors are
unaffected. Remove them from the list.
Reported-by: ave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, avg_lat is calculated by accumulating the mean of every
window in a long running cumulative average. As time goes on, the metric
becomes less and less useful due to the accumulated history.
This patch reuses the same calculation done in load averages to make the
avg_lat metric more lively. Unlike load averages, the avg only advances
when a window elapses (due to an io). Idle periods extend the most
recent window. Bucketing is used to limit the history of avg_lat by
binding it to the window size. So, the window range for 1/exp (decay
rate) is [1 min, 2.5 min) when windows elapse immediately.
The current sample window size is exposed in the debug info to enable
calculation of the window range.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently printing [hashed] pointers requires enough entropy to be
available. Early in the boot sequence this may not be the case
resulting in a dummy string '(____ptrval____)' being printed. This
makes debugging the early boot sequence difficult. We can relax the
requirement to use cryptographically secure hashing during debugging.
This enables debugging while keeping development/production kernel
behaviour the same.
If new command line option debug_boot_weak_hash is enabled use
cryptographically insecure hashing and hash pointer value immediately.
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull RCU updates from Paul E. McKenney:
- An optimization and a fix for RCU expedited grace periods, with
the fix being from Boqun Feng.
- Miscellaneous fixes, including a lockdep-annotation fix from
Boqun Feng.
- SRCU updates.
- Updates to rcutorture and associated scripting.
- Introduce grace-period sequence numbers to the RCU-bh, RCU-preempt,
and RCU-sched flavors, replacing the old ->gpnum and ->completed
pair of fields. This change allows lockless code to obtain the
complete grace-period state with a single READ_ONCE(), which is
needed to maintain tolerable lock contention during the upcoming
consolidation of the three RCU flavors. Note that grace-period
sequence numbers are already used by rcu_barrier(), expedited
RCU grace periods, and SRCU, and are thus already heavily used
and well-tested. Joel Fernandes contributed a number of excellent
fixes and improvements.
- Clean up some grace-period-reporting loose ends, including
improving the handling of quiescent states from offline CPUs
and fixing some false-positive WARN_ON_ONCE() invocations.
(Strictly speaking, the WARN_ON_ONCE() invocations were quite
correct, but their invariants were (harmlessly) violated by the
earlier sloppy handling of quiescent states from offline CPUs.)
In addition, improve grace-period forward-progress guarantees so
as to allow removal of fail-safe checks that required otherwise
needless lock acquisitions. Finally, add more diagnostics to
help debug the upcoming consolidation of the RCU-bh, RCU-preempt,
and RCU-sched flavors.
- Additional miscellaneous fixes, including those contributed by
Byungchul Park, Mauro Carvalho Chehab, Joe Perches, Joel Fernandes,
Steven Rostedt, Andrea Parri, and Neil Brown.
- Additional torture-test changes, including several contributed by
Arnd Bergmann and Joel Fernandes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add documentation for the L1TF vulnerability and the mitigation mechanisms:
- Explain the problem and risks
- Document the mitigation mechanisms
- Document the command line controls
- Document the sysfs files
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180713142323.287429944@linutronix.de
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
Some RCU bugs have been sensitive to the frequency of CPU-hotplug
operations, which have been gradually increased over time. But this
frequency is now at the one-second lower limit that can be specified using
the rcutorture.onoff_interval kernel parameter. This commit therefore
changes the units of rcutorture.onoff_interval from seconds to jiffies,
and also sets the value specified for this kernel parameter in the TREE03
rcutorture scenario to 200, which is 200 milliseconds for HZ=1000.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Document the support for spec_store_bypass_disable that was added for
powerpc in commit a048a07d7f ("powerpc/64s: Add support for a store
forwarding barrier at kernel entry/exit").
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This parameter introduced several years ago in the XHCI host controller
driver was somehow left undocumented. Add a few lines in the kernel
parameters text.
Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
A basic documentation to describe the interface, statistics, and
behavior of io.latency.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This parameter introduced several years ago in the XHCI host controller
driver was somehow left undocumented. Add a few lines in the kernel
parameters text.
Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
L1 terminal fault. The valid arguments are:
- "always" L1D cache flush on every VMENTER.
- "cond" Conditional L1D cache flush, explained below
- "never" Disable the L1D cache flush mitigation
"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.
[ tglx: Split out from a larger patch ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the L1TF CPU bug is present we allow the KVM module to be loaded as the
major of users that use Linux and KVM have trusted guests and do not want a
broken setup.
Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
such they are the ones that should set nosmt to one.
Setting 'nosmt' means that the system administrator also needs to disable
SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit
05736e4ac1 ("cpu/hotplug: Provide knobs to control SMT").
Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Enabling HARDENED_USERCOPY may cause measurable regressions in networking
performance: up to 8% under UDP flood.
I ran a small packet UDP flood using pktgen vs. a host b2b connected. On
the receiver side the UDP packets are processed by a simple user space
process that just reads and drops them:
https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
Not very useful from a functional PoV, but it helps to pin-point
bottlenecks in the networking stack.
When running a kernel with CONFIG_HARDENED_USERCOPY=y, I see a 5-8%
regression in the receive tput, compared to the same kernel without this
option enabled.
With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).
The call-chain is:
__GI___libc_recvfrom
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
inet_recvmsg
udp_recvmsg
__check_object_size
udp_recvmsg() actually calls copy_to_iter() (inlined) and the latters
calls check_copy_size() (again, inlined).
A generic distro may want to enable HARDENED_USERCOPY in their default
kernel config, but at the same time, such distro may want to be able to
avoid the performance penalties in with the default configuration and
disable the stricter check on a per-boot basis.
This change adds a boot parameter that conditionally disables
HARDENED_USERCOPY via "hardened_usercopy=off".
Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAls5Xh8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGGB4H/3zJ73feDP9uUABk
92tGQbc9PEtrkpOhACBUzVIIxMePgfqFZ+xF9KXwL9fzf+PEJrj9ILZlZ2XxNblN
CO1U95+9nFntGF88DK3h5Hmkn/Kc8kwrbzjax9yvVKfIDX6HwtVUh49BuV7yKza8
ntwzu6ONCqSdDoCRubGxAAoZXGPpznG6FgRpsIVCHxv2Pu/YpQ2vdn9+vHandjAR
gvYnBv4Z06OZU65ABIZ4ivr1SExhSxz6yoZAjSUQvBh6bcKDOVTtTQcdndp5Rzm2
nvuEioKuvyamRswp7LBCIfWs1TSEi6qsUrdcVWNEdwkA164VHqBRfyX176g65v3S
a0df6n4=
=aK8y
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc3' into docs-next
-rc1 broke the docs build due to changes in the e100/e1000 drivers; -rc3
got the fixes via the networking tree. Pull in -rc3 so that the docs tree
can actually build the docs again.
Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.
The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:
Because the logical processors within a physical package are tightly
coupled with respect to shared hardware resources, both logical
processors are notified of machine check errors that occur within a
given physical processor. If machine-check exceptions are enabled when
a fatal error is reported, all the logical processors within a physical
package are dispatched to the machine-check exception handler. If
machine-check exceptions are disabled, the logical processors enter the
shutdown state and assert the IERR# signal. When enabling machine-check
exceptions, the MCE flag in control register CR4 should be set for each
logical processor.
Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.
This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:
MCE is enabled on the boot CPU:
[ 0.244017] mce: CPU supports 22 MCE banks
The corresponding sibling #72 boots:
[ 1.008005] .... node #0, CPUs: #72
That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.
It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.
Adjust the nosmt kernel parameter documentation as well.
Reverts: 2207def700 ("x86/apic: Ignore secondary threads if nosmt=force")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Move early dump functionality into common code so that it is available for
all architectures. No need to carry arch-specific reads around as the read
hooks are already initialized by the time pci_setup_device() is getting
called during scan.
Tested-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Add a label to the top of the file to allow cross-referencing.
Currently it's not possible to cross-reference this file from
Documentation/process/howto.rst because of the missing label.
Signed-off-by: Michael Rodin <michael-git@rodin.online>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Document the recently introduced hwp_dynamic_boost sysfs knob
allowing user space to tell intel_pstate to use iowait boosting
in the active mode with HWP enabled (to improve performance).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Fix an incorrect sysfs path in the intel_pstate admin-guide
documentation.
Fixes: 33fc30b470 (cpufreq: intel_pstate: Document the current behavior and user interface)
Reported-by: Pawit Pornkitprasan <p.pawit@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Provide a command line and a sysfs knob to control SMT.
The command line options are:
'nosmt': Enumerate secondary threads, but do not online them
'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.
The sysfs control file has the following states (read/write):
'on': SMT is enabled. Secondary threads can be freely onlined
'off': SMT is disabled. Secondary threads, even if enumerated
cannot be onlined
'forceoff': SMT is permanentely disabled. Writes to the control
file are rejected.
'notsupported': SMT is not supported by the CPU
The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.
The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.
When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.
When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.
When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.
When the control status is 'notsupported' then writes to the control file
are rejected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
The alsa parameters file was renamed to alsa-configuration.rst.
With regards to OSS, it got retired as a hole by at changeset
727dede0ba ("sound: Retire OSS"). So, it doesn't make sense
to keep mentioning it at kernel-parameters.txt.
Fixes: 727dede0ba ("sound: Retire OSS")
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
As we move stuff around, some doc references are broken. Fix some of
them via this script:
./scripts/documentation-file-ref-check --fix
Manually checked if the produced result is valid, removing a few
false-positives.
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
- add support for mapping secids and using secctxes
- add the ability to get a task's secid
- add support for audit rule filtering
+ Cleanups
- multiple typo fixes
- Convert to use match_string() helper
- update git and wiki locations in AppArmor docs
- improve get_buffers macro by using get_cpu_ptr
- Use an IDR to allocate apparmor secids
+ Bug fixes
- fix '*seclen' is never less than zero
- fix mediation of prlimit
- fix memory leak when deduping profile load
- fix ptrace read check
- fix memory leak of rule on error exit path
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJbIPxYAAoJEAUvNnAY1cPYVOQQAKfVO71Mk1U6zegWk8VJoiRy
/wb3ZjMy9KCE5UWNPp0jyB3qzFpejZizycRwVS2k1l/SjugACxvq1fyZ85bzys10
pb8efsWU/Co4l45PfaHpoqCJYr3+3/PBPwSU9vb8ScEFnb95D+0d7KRgA6uIC7lE
H/zbjot1AXGX0CVKmQkKXdi+Ldnbzqv7GtCzipKWDeD0JJqgOKu8NOnnAfJiSNs7
YlIhcr6K4nRxHJ6e8vxbYeogbBzmVWZwWHN8ViXj5Bbox93FRlkkSqxw8Ke8SmXi
y/wQabMQMPZHr2SvQjvFD3cpBmKaMG9NktIjy/4tYcTbhZPNgx/wJSSzRiySFTiW
hPbXWueI75P3Zepj4rRaXy0T68fQaj4k2lTItxkqGN1UOu8mibMlOkE6ZmllTKO7
xPvLgZL7/vYS0fKqJaikZbMhWTBtQD/w0ZwYzmT77umOgRHQvrGKi9nk49fIigOo
aftf8VIjMBUND2JMWCQn1d33CJUXdONpW0aX6cr5Xxthnlz5+aa9Ki2s58BFMVI3
PSMhOr6kdpxrkemEnoVnFMohxRb+u046ecM5X5E2rMEbH3PHow5bzaXyTBHFAiYY
rPn/sKNaXtw4hdMcnv9lmFKyObAdoBxY4bRKzrPTC66sIMncLYVzcSzWY6C3bMfm
tuu+zmVF0v5JENrcwccQ
=EVj2
-----END PGP SIGNATURE-----
Merge tag 'apparmor-pr-2018-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
Pull AppArmor updates from John Johansen:
"Features
- add support for mapping secids and using secctxes
- add the ability to get a task's secid
- add support for audit rule filtering
Cleanups:
- multiple typo fixes
- Convert to use match_string() helper
- update git and wiki locations in AppArmor docs
- improve get_buffers macro by using get_cpu_ptr
- Use an IDR to allocate apparmor secids
Bug fixes:
- fix '*seclen' is never less than zero
- fix mediation of prlimit
- fix memory leak when deduping profile load
- fix ptrace read check
- fix memory leak of rule on error exit path"
* tag 'apparmor-pr-2018-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor: (21 commits)
apparmor: fix ptrace read check
apparmor: fix memory leak when deduping profile load
apparmor: fix mediation of prlimit
apparmor: fixup secid map conversion to using IDR
apparmor: Use an IDR to allocate apparmor secids
apparmor: Fix memory leak of rule on error exit path
apparmor: modify audit rule support to support profile stacks
apparmor: Add support for audit rule filtering
apparmor: update git and wiki locations in AppArmor docs
apparmor: Convert to use match_string() helper
apparmor: improve get_buffers macro by using get_cpu_ptr
apparmor: fix '*seclen' is never less than zero
apparmor: fix typo "preconfinement"
apparmor: fix typo "independent"
apparmor: fix typo "traverse"
apparmor: fix typo "type"
apparmor: fix typo "replace"
apparmor: fix typo "comparison"
apparmor: fix typo "loosen"
apparmor: add the ability to get a task's secid
...
- Spectre v4 mitigation (Speculative Store Bypass Disable) support for
arm64 using SMC firmware call to set a hardware chicken bit
- ACPI PPTT (Processor Properties Topology Table) parsing support and
enable the feature for arm64
- Report signal frame size to user via auxv (AT_MINSIGSTKSZ). The
primary motivation is Scalable Vector Extensions which requires more
space on the signal frame than the currently defined MINSIGSTKSZ
- ARM perf patches: allow building arm-cci as module, demote dev_warn()
to dev_dbg() in arm-ccn event_init(), miscellaneous cleanups
- cmpwait() WFE optimisation to avoid some spurious wakeups
- L1_CACHE_BYTES reverted back to 64 (for performance reasons that have
to do with some network allocations) while keeping ARCH_DMA_MINALIGN
to 128. cache_line_size() returns the actual hardware Cache Writeback
Granule
- Turn LSE atomics on by default in Kconfig
- Kernel fault reporting tidying
- Some #include and miscellaneous cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAlsaoqsACgkQa9axLQDI
XvH+8RAAqRCrEtkNPS7zxHyMK/D2cxSy9EVtlJ1sxhmsONEe5t5MDTWX9byobQ5A
PAKMSQBQgUvecqHLOtD7SJWef1il30zgWmc/yPcgNv3OsA1Au7j2g3ht/Drw+N5I
Vy0aOUEtw+Jzs7y/CJyl6lufSkkOzszOujt2Nybiz6omztOrwkW9isKnURzQBNj5
gquZI35h604YJ9F0TqS6ZqU7tNcuB9q02FxvVBpLmb83jP4jSEjYACUJwVVxvEAB
UXjdD4N130rRXDS5OMRWo5+4SAj+kPYhdVYEvaDx7xTOIRHhXK05GlJbsUAc5E6l
xy810fH5Dm0diYpVvYWTA5J+BU1jNOvCys5zKWl7gs2P8YB59PdqY4M2YBPNGb5H
PaVgq73TZAsww6ZInbZlK+wZOIxZZIOf//Z+QKn6EPtu3RmzIFWwyttTj01w1E3i
LhjcUoGnvxJFcMoCr59ihDwfP9nkCVrNc4REOGaWDk6L/t/bOfaZfDz+OCGbwQdL
akCFKZI6q5O/no+YfhtdtNFpCQb/Bo1J88KuotICRXq8z4vO41zIG53bi97W8QeG
rCBiX0NxUxYJ3ybus7kZHTmMGieMyEHP28n12QffwvJj4vJBsUXQBrV8hclx0djZ
HMt7iPi/0BW6nVV7ngIgN3cdCpaDCEGRsfO4Ch0rFZrC9UbYQnE=
=uums
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Apart from the core arm64 and perf changes, the Spectre v4 mitigation
touches the arm KVM code and the ACPI PPTT support touches drivers/
(acpi and cacheinfo). I should have the maintainers' acks in place.
Summary:
- Spectre v4 mitigation (Speculative Store Bypass Disable) support
for arm64 using SMC firmware call to set a hardware chicken bit
- ACPI PPTT (Processor Properties Topology Table) parsing support and
enable the feature for arm64
- Report signal frame size to user via auxv (AT_MINSIGSTKSZ). The
primary motivation is Scalable Vector Extensions which requires
more space on the signal frame than the currently defined
MINSIGSTKSZ
- ARM perf patches: allow building arm-cci as module, demote
dev_warn() to dev_dbg() in arm-ccn event_init(), miscellaneous
cleanups
- cmpwait() WFE optimisation to avoid some spurious wakeups
- L1_CACHE_BYTES reverted back to 64 (for performance reasons that
have to do with some network allocations) while keeping
ARCH_DMA_MINALIGN to 128. cache_line_size() returns the actual
hardware Cache Writeback Granule
- Turn LSE atomics on by default in Kconfig
- Kernel fault reporting tidying
- Some #include and miscellaneous cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (53 commits)
arm64: Fix syscall restarting around signal suppressed by tracer
arm64: topology: Avoid checking numa mask for scheduler MC selection
ACPI / PPTT: fix build when CONFIG_ACPI_PPTT is not enabled
arm64: cpu_errata: include required headers
arm64: KVM: Move VCPU_WORKAROUND_2_FLAG macros to the top of the file
arm64: signal: Report signal frame size to userspace via auxv
arm64/sve: Thin out initialisation sanity-checks for sve_max_vl
arm64: KVM: Add ARCH_WORKAROUND_2 discovery through ARCH_FEATURES_FUNC_ID
arm64: KVM: Handle guest's ARCH_WORKAROUND_2 requests
arm64: KVM: Add ARCH_WORKAROUND_2 support for guests
arm64: KVM: Add HYP per-cpu accessors
arm64: ssbd: Add prctl interface for per-thread mitigation
arm64: ssbd: Introduce thread flag to control userspace mitigation
arm64: ssbd: Restore mitigation status on CPU resume
arm64: ssbd: Skip apply_ssbd if not using dynamic mitigation
arm64: ssbd: Add global mitigation state accessor
arm64: Add 'ssbd' command-line option
arm64: Add ARCH_WORKAROUND_2 probing
arm64: Add per-cpu infrastructure to call ARCH_WORKAROUND_2
arm64: Call ARCH_WORKAROUND_2 on transitions between EL0 and EL1
...
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- v9fs updates
- MM
- procfs updates
- lib/ updates
- autofs updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
autofs: small cleanup in autofs_getpath()
autofs: clean up includes
autofs: comment on selinux changes needed for module autoload
autofs: update MAINTAINERS entry for autofs
autofs: use autofs instead of autofs4 in documentation
autofs: rename autofs documentation files
autofs: create autofs Kconfig and Makefile
autofs: delete fs/autofs4 source files
autofs: update fs/autofs4/Makefile
autofs: update fs/autofs4/Kconfig
autofs: copy autofs4 to autofs
autofs4: use autofs instead of autofs4 everywhere
autofs4: merge auto_fs.h and auto_fs4.h
fs/binfmt_misc.c: do not allow offset overflow
checkpatch: improve patch recognition
lib/ucs2_string.c: add MODULE_LICENSE()
lib/mpi: headers cleanup
lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
lib/idr.c: remove simple_ida_lock
lib/bitmap.c: micro-optimization for __bitmap_complement()
...
Currently an attempt to set swap.max into a value lower than the actual
swap usage fails, which causes configuration problems as there's no way
of lowering the configuration below the current usage short of turning
off swap entirely. This makes swap.max difficult to use and allows
delegatees to lock the delegator out of reducing swap allocation.
This patch updates swap_max_write() so that the limit can be lowered
below the current usage. It doesn't implement active reclaiming of swap
entries for the following reasons.
* mem_cgroup_swap_full() already tells the swap machinary to
aggressively reclaim swap entries if the usage is above 50% of
limit, so simply lowering the limit automatically triggers gradual
reclaim.
* Forcing back swapped out pages is likely to heavily impact the
workload and mess up the working set. Given that swap usually is a
lot less valuable and less scarce, letting the existing usage
dissipate over time through the above gradual reclaim and as they're
falted back in is likely the better behavior.
Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.
But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory. This makes it pretty useless
against any sort of slow memory leaks or memory usage increases. This
is especially true for swapless systems. If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense. The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.
It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2 memory
controller. It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.
If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.
[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add swap max and fail events so that userland can monitor and respond to
running out of swap.
I'm not too sure about the fail event. Right now, it's a bit confusing
which stats / events are recursive and which aren't and also which ones
reflect events which originate from a given cgroup and which targets the
cgroup. No idea what the right long term solution is and it could just
be that growing them organically is actually the only right thing to do.
Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>