mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2025-01-19 06:16:36 +07:00
There's been a fair amount of work in the docs tree this time around,
including: - Extensive RST conversions and organizational work in the memory-management docs thanks to Mike Rapoport. - An update of Documentation/features from Andrea Parri and a script to keep it updated. - Various LICENSES updates from Thomas, along with a script to check SPDX tags. - Work to fix dangling references to documentation files; this involved a fair number of one-liner comment changes outside of Documentation/ ...and the usual list of documentation improvements, typo fixes, etc. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJbFTkKAAoJEI3ONVYwIuV6t24P/0K9qltHkLwsBo2fbGu/emem mb1QrZCFZGebKVrCIvET3YcT0q0xPW+ZldwMQYEUeCcu/vD3cGHGXlDbVJCa1fFD 2OS10W/sEObPnREtlHO/zAzpapKP9DO1/f6NhO55iBJLGOCgoLL5xvSqgsI8MTGd vcJDXLitkh4CJEcfNLkQt8dEZzq9Tb6wdSFIvZBBXRNon2ItVN92D5xoQ0wtB+qt KmcGYofajK9bjtZpnC4iNg3i+zdwkd80bGTEN9f0hJTRZK5emCILk8fip8CMhRuB iwmcqb2RnMLydNLyK9RSs6OS5z3G4fYu9llRtLlZBAupcjRVpalWaBGxLOVO6jBG mvkqdKPMtxV4c7NvwKwFQL9dcjtxsxO4RDRYVWN82dS1L6WKKk8UvTuJUBLH0YA5 af7ZKn7mJVhJ1cxPblaEBOBM3oQuk57LLkjmcpMOXyJ/IOkTIuV1Ezht+XzFyFQv VWSyekiKo+8D6WHACPTaWiizjW15e8CyP+WIhKzJyn7VQQrZwhsOS+R//ITsuvQ0 vRdZ20lwUeBhR+mnXd5NfIo2w7G+OiqiREVAgxjgRrS0PnkzWG7lzzcSVU8HTfT4 S7VXqval2a9Xg+N8aU2JUe49W858J8hKvIa98hBxGoZa84wxOGtEo7pIKhnMwMSe Uhkh/1/bQMxsK3fBEF74 =I6FG -----END PGP SIGNATURE----- Merge tag 'docs-4.18' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "There's been a fair amount of work in the docs tree this time around, including: - Extensive RST conversions and organizational work in the memory-management docs thanks to Mike Rapoport. - An update of Documentation/features from Andrea Parri and a script to keep it updated. - Various LICENSES updates from Thomas, along with a script to check SPDX tags. - Work to fix dangling references to documentation files; this involved a fair number of one-liner comment changes outside of Documentation/ ... and the usual list of documentation improvements, typo fixes, etc" * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits) Documentation: document hung_task_panic kernel parameter docs/admin-guide/mm: add high level concepts overview docs/vm: move ksm and transhuge from "user" to "internals" section. docs: Use the kerneldoc comments for memalloc_no*() doc: document scope NOFS, NOIO APIs docs: update kernel versions and dates in tables docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge docs/vm: transhuge: minor updates docs/vm: transhuge: change sections order Documentation: arm: clean up Marvell Berlin family info Documentation: gpio: driver: Fix a typo and some odd grammar docs: ranoops.rst: fix location of ramoops.txt scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode docs: uio-howto.rst: use a code block to solve a warning mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback w1: w1_io.c: fix a kernel-doc warning Documentation/process/posting: wrap text at 80 cols docs: admin-guide: add cgroup-v2 documentation Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'" Documentation: refcount-vs-atomic: Update reference to LKMM doc. ...
This commit is contained in:
commit
eeee3149aa
@ -64,8 +64,6 @@ auxdisplay/
|
||||
- misc. LCD driver documentation (cfag12864b, ks0108).
|
||||
backlight/
|
||||
- directory with info on controlling backlights in flat panel displays
|
||||
bcache.txt
|
||||
- Block-layer cache on fast SSDs to improve slow (raid) I/O performance.
|
||||
block/
|
||||
- info on the Block I/O (BIO) layer.
|
||||
blockdev/
|
||||
@ -78,18 +76,10 @@ bus-devices/
|
||||
- directory with info on TI GPMC (General Purpose Memory Controller)
|
||||
bus-virt-phys-mapping.txt
|
||||
- how to access I/O mapped memory from within device drivers.
|
||||
cachetlb.txt
|
||||
- describes the cache/TLB flushing interfaces Linux uses.
|
||||
cdrom/
|
||||
- directory with information on the CD-ROM drivers that Linux has.
|
||||
cgroup-v1/
|
||||
- cgroups v1 features, including cpusets and memory controller.
|
||||
cgroup-v2.txt
|
||||
- cgroups v2 features, including cpusets and memory controller.
|
||||
circular-buffers.txt
|
||||
- how to make use of the existing circular buffer infrastructure
|
||||
clk.txt
|
||||
- info on the common clock framework
|
||||
cma/
|
||||
- Continuous Memory Area (CMA) debugfs interface.
|
||||
conf.py
|
||||
|
@ -90,4 +90,4 @@ Date: December 2009
|
||||
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
|
||||
Description:
|
||||
The node's huge page size control/query attributes.
|
||||
See Documentation/vm/hugetlbpage.txt
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst
|
@ -12,4 +12,4 @@ Description:
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
resv_hugepages
|
||||
See Documentation/vm/hugetlbpage.txt for details.
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst for details.
|
||||
|
@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface
|
||||
sleep_millisecs: how many milliseconds ksm should sleep between
|
||||
scans.
|
||||
|
||||
See Documentation/vm/ksm.txt for more information.
|
||||
See Documentation/vm/ksm.rst for more information.
|
||||
|
||||
What: /sys/kernel/mm/ksm/merge_across_nodes
|
||||
Date: January 2013
|
||||
|
@ -37,7 +37,7 @@ Description:
|
||||
The alloc_calls file is read-only and lists the kernel code
|
||||
locations from which allocations for this cache were performed.
|
||||
The alloc_calls file only contains information if debugging is
|
||||
enabled for that cache (see Documentation/vm/slub.txt).
|
||||
enabled for that cache (see Documentation/vm/slub.rst).
|
||||
|
||||
What: /sys/kernel/slab/cache/alloc_fastpath
|
||||
Date: February 2008
|
||||
@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>,
|
||||
Description:
|
||||
The free_calls file is read-only and lists the locations of
|
||||
object frees if slab debugging is enabled (see
|
||||
Documentation/vm/slub.txt).
|
||||
Documentation/vm/slub.rst).
|
||||
|
||||
What: /sys/kernel/slab/cache/free_fastpath
|
||||
Date: February 2008
|
||||
|
@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
:maxdepth: 1
|
||||
|
||||
initrd
|
||||
cgroup-v2
|
||||
serial-console
|
||||
braille-console
|
||||
parport
|
||||
@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking.
|
||||
mono
|
||||
java
|
||||
ras
|
||||
bcache
|
||||
pm/index
|
||||
thunderbolt
|
||||
LSM/index
|
||||
mm/index
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
@ -106,11 +106,11 @@
|
||||
use by PCI
|
||||
Format: <irq>,<irq>...
|
||||
|
||||
acpi_mask_gpe= [HW,ACPI]
|
||||
acpi_mask_gpe= [HW,ACPI]
|
||||
Due to the existence of _Lxx/_Exx, some GPEs triggered
|
||||
by unsupported hardware/firmware features can result in
|
||||
GPE floodings that cannot be automatically disabled by
|
||||
the GPE dispatcher.
|
||||
GPE floodings that cannot be automatically disabled by
|
||||
the GPE dispatcher.
|
||||
This facility can be used to prevent such uncontrolled
|
||||
GPE floodings.
|
||||
Format: <int>
|
||||
@ -472,10 +472,10 @@
|
||||
for platform specific values (SB1, Loongson3 and
|
||||
others).
|
||||
|
||||
ccw_timeout_log [S390]
|
||||
ccw_timeout_log [S390]
|
||||
See Documentation/s390/CommonIO for details.
|
||||
|
||||
cgroup_disable= [KNL] Disable a particular controller
|
||||
cgroup_disable= [KNL] Disable a particular controller
|
||||
Format: {name of the controller(s) to disable}
|
||||
The effects of cgroup_disable=foo are:
|
||||
- foo isn't auto-mounted if you mount all cgroups in
|
||||
@ -518,7 +518,7 @@
|
||||
those clocks in any way. This parameter is useful for
|
||||
debug and development, but should not be needed on a
|
||||
platform with proper driver support. For more
|
||||
information, see Documentation/clk.txt.
|
||||
information, see Documentation/driver-api/clk.rst.
|
||||
|
||||
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
|
||||
[Deprecated]
|
||||
@ -641,8 +641,8 @@
|
||||
hvc<n> Use the hypervisor console device <n>. This is for
|
||||
both Xen and PowerPC hypervisors.
|
||||
|
||||
If the device connected to the port is not a TTY but a braille
|
||||
device, prepend "brl," before the device type, for instance
|
||||
If the device connected to the port is not a TTY but a braille
|
||||
device, prepend "brl," before the device type, for instance
|
||||
console=brl,ttyS0
|
||||
For now, only VisioBraille is supported.
|
||||
|
||||
@ -662,7 +662,7 @@
|
||||
|
||||
consoleblank= [KNL] The console blank (screen saver) timeout in
|
||||
seconds. A value of 0 disables the blank timer.
|
||||
Defaults to 0.
|
||||
Defaults to 0.
|
||||
|
||||
coredump_filter=
|
||||
[KNL] Change the default value for
|
||||
@ -730,7 +730,7 @@
|
||||
or memory reserved is below 4G.
|
||||
|
||||
cryptomgr.notests
|
||||
[KNL] Disable crypto self-tests
|
||||
[KNL] Disable crypto self-tests
|
||||
|
||||
cs89x0_dma= [HW,NET]
|
||||
Format: <dma>
|
||||
@ -746,7 +746,7 @@
|
||||
Format: <port#>,<type>
|
||||
See also Documentation/input/devices/joystick-parport.rst
|
||||
|
||||
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
|
||||
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
|
||||
time. See
|
||||
Documentation/admin-guide/dynamic-debug-howto.rst for
|
||||
details. Deprecated, see dyndbg.
|
||||
@ -833,7 +833,7 @@
|
||||
causing system reset or hang due to sending
|
||||
INIT from AP to BSP.
|
||||
|
||||
disable_ddw [PPC/PSERIES]
|
||||
disable_ddw [PPC/PSERIES]
|
||||
Disable Dynamic DMA Window support. Use this if
|
||||
to workaround buggy firmware.
|
||||
|
||||
@ -1188,7 +1188,7 @@
|
||||
parameter will force ia64_sal_cache_flush to call
|
||||
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
|
||||
|
||||
forcepae [X86-32]
|
||||
forcepae [X86-32]
|
||||
Forcefully enable Physical Address Extension (PAE).
|
||||
Many Pentium M systems disable PAE but may have a
|
||||
functionally usable PAE implementation.
|
||||
@ -1247,7 +1247,7 @@
|
||||
|
||||
gamma= [HW,DRM]
|
||||
|
||||
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
|
||||
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
|
||||
Format: off | on
|
||||
default: on
|
||||
|
||||
@ -1341,23 +1341,32 @@
|
||||
x86-64 are 2M (when the CPU supports "pse") and 1G
|
||||
(when the CPU supports the "pdpe1gb" cpuinfo flag).
|
||||
|
||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||
terminal devices. Valid values: 0..8
|
||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||
If specified, z/VM IUCV HVC accepts connections
|
||||
from listed z/VM user IDs only.
|
||||
hung_task_panic=
|
||||
[KNL] Should the hung task detector generate panics.
|
||||
Format: <integer>
|
||||
|
||||
A nonzero value instructs the kernel to panic when a
|
||||
hung task is detected. The default value is controlled
|
||||
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
|
||||
option. The value selected by this boot parameter can
|
||||
be changed later by the kernel.hung_task_panic sysctl.
|
||||
|
||||
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
|
||||
terminal devices. Valid values: 0..8
|
||||
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
|
||||
If specified, z/VM IUCV HVC accepts connections
|
||||
from listed z/VM user IDs only.
|
||||
keep_bootcon [KNL]
|
||||
Do not unregister boot console at start. This is only
|
||||
useful for debugging when something happens in the window
|
||||
between unregistering the boot console and initializing
|
||||
the real console.
|
||||
|
||||
i2c_bus= [HW] Override the default board specific I2C bus speed
|
||||
or register an additional I2C bus that is not
|
||||
registered from board initialization code.
|
||||
Format:
|
||||
<bus_id>,<clkrate>
|
||||
i2c_bus= [HW] Override the default board specific I2C bus speed
|
||||
or register an additional I2C bus that is not
|
||||
registered from board initialization code.
|
||||
Format:
|
||||
<bus_id>,<clkrate>
|
||||
|
||||
i8042.debug [HW] Toggle i8042 debug mode
|
||||
i8042.unmask_kbd_data
|
||||
@ -1386,7 +1395,7 @@
|
||||
Default: only on s2r transitions on x86; most other
|
||||
architectures force reset to be always executed
|
||||
i8042.unlock [HW] Unlock (ignore) the keylock
|
||||
i8042.kbdreset [HW] Reset device connected to KBD port
|
||||
i8042.kbdreset [HW] Reset device connected to KBD port
|
||||
|
||||
i810= [HW,DRM]
|
||||
|
||||
@ -1548,13 +1557,13 @@
|
||||
programs exec'd, files mmap'd for exec, and all files
|
||||
opened for read by uid=0.
|
||||
|
||||
ima_template= [IMA]
|
||||
ima_template= [IMA]
|
||||
Select one of defined IMA measurements template formats.
|
||||
Formats: { "ima" | "ima-ng" | "ima-sig" }
|
||||
Default: "ima-ng"
|
||||
|
||||
ima_template_fmt=
|
||||
[IMA] Define a custom template format.
|
||||
[IMA] Define a custom template format.
|
||||
Format: { "field1|...|fieldN" }
|
||||
|
||||
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
|
||||
@ -1597,7 +1606,7 @@
|
||||
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
|
||||
Format: <irq>
|
||||
|
||||
int_pln_enable [x86] Enable power limit notification interrupt
|
||||
int_pln_enable [x86] Enable power limit notification interrupt
|
||||
|
||||
integrity_audit=[IMA]
|
||||
Format: { "0" | "1" }
|
||||
@ -1650,39 +1659,39 @@
|
||||
0 disables intel_idle and fall back on acpi_idle.
|
||||
1 to 9 specify maximum depth of C-state.
|
||||
|
||||
intel_pstate= [X86]
|
||||
disable
|
||||
Do not enable intel_pstate as the default
|
||||
scaling driver for the supported processors
|
||||
passive
|
||||
Use intel_pstate as a scaling driver, but configure it
|
||||
to work with generic cpufreq governors (instead of
|
||||
enabling its internal governor). This mode cannot be
|
||||
used along with the hardware-managed P-states (HWP)
|
||||
feature.
|
||||
force
|
||||
Enable intel_pstate on systems that prohibit it by default
|
||||
in favor of acpi-cpufreq. Forcing the intel_pstate driver
|
||||
instead of acpi-cpufreq may disable platform features, such
|
||||
as thermal controls and power capping, that rely on ACPI
|
||||
P-States information being indicated to OSPM and therefore
|
||||
should be used with caution. This option does not work with
|
||||
processors that aren't supported by the intel_pstate driver
|
||||
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
|
||||
no_hwp
|
||||
Do not enable hardware P state control (HWP)
|
||||
if available.
|
||||
hwp_only
|
||||
Only load intel_pstate on systems which support
|
||||
hardware P state control (HWP) if available.
|
||||
support_acpi_ppc
|
||||
Enforce ACPI _PPC performance limits. If the Fixed ACPI
|
||||
Description Table, specifies preferred power management
|
||||
profile as "Enterprise Server" or "Performance Server",
|
||||
then this feature is turned on by default.
|
||||
per_cpu_perf_limits
|
||||
Allow per-logical-CPU P-State performance control limits using
|
||||
cpufreq sysfs interface
|
||||
intel_pstate= [X86]
|
||||
disable
|
||||
Do not enable intel_pstate as the default
|
||||
scaling driver for the supported processors
|
||||
passive
|
||||
Use intel_pstate as a scaling driver, but configure it
|
||||
to work with generic cpufreq governors (instead of
|
||||
enabling its internal governor). This mode cannot be
|
||||
used along with the hardware-managed P-states (HWP)
|
||||
feature.
|
||||
force
|
||||
Enable intel_pstate on systems that prohibit it by default
|
||||
in favor of acpi-cpufreq. Forcing the intel_pstate driver
|
||||
instead of acpi-cpufreq may disable platform features, such
|
||||
as thermal controls and power capping, that rely on ACPI
|
||||
P-States information being indicated to OSPM and therefore
|
||||
should be used with caution. This option does not work with
|
||||
processors that aren't supported by the intel_pstate driver
|
||||
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
|
||||
no_hwp
|
||||
Do not enable hardware P state control (HWP)
|
||||
if available.
|
||||
hwp_only
|
||||
Only load intel_pstate on systems which support
|
||||
hardware P state control (HWP) if available.
|
||||
support_acpi_ppc
|
||||
Enforce ACPI _PPC performance limits. If the Fixed ACPI
|
||||
Description Table, specifies preferred power management
|
||||
profile as "Enterprise Server" or "Performance Server",
|
||||
then this feature is turned on by default.
|
||||
per_cpu_perf_limits
|
||||
Allow per-logical-CPU P-State performance control limits using
|
||||
cpufreq sysfs interface
|
||||
|
||||
intremap= [X86-64, Intel-IOMMU]
|
||||
on enable Interrupt Remapping (default)
|
||||
@ -2026,7 +2035,7 @@
|
||||
* [no]ncqtrim: Turn off queued DSM TRIM.
|
||||
|
||||
* nohrst, nosrst, norst: suppress hard, soft
|
||||
and both resets.
|
||||
and both resets.
|
||||
|
||||
* rstonce: only attempt one reset during
|
||||
hot-unplug link recovery
|
||||
@ -2214,7 +2223,7 @@
|
||||
[KNL,SH] Allow user to override the default size for
|
||||
per-device physically contiguous DMA buffers.
|
||||
|
||||
memhp_default_state=online/offline
|
||||
memhp_default_state=online/offline
|
||||
[KNL] Set the initial state for the memory hotplug
|
||||
onlining policy. If not specified, the default value is
|
||||
set according to the
|
||||
@ -2764,7 +2773,7 @@
|
||||
[X86,PV_OPS] Disable paravirtualized VMware scheduler
|
||||
clock and use the default one.
|
||||
|
||||
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
|
||||
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
|
||||
steal time is computed, but won't influence scheduler
|
||||
behaviour
|
||||
|
||||
@ -2825,7 +2834,7 @@
|
||||
notsc [BUGS=X86-32] Disable Time Stamp Counter
|
||||
|
||||
nowatchdog [KNL] Disable both lockup detectors, i.e.
|
||||
soft-lockup and NMI watchdog (hard-lockup).
|
||||
soft-lockup and NMI watchdog (hard-lockup).
|
||||
|
||||
nowb [ARM]
|
||||
|
||||
@ -2845,7 +2854,7 @@
|
||||
If the dependencies are under your control, you can
|
||||
turn on cpu0_hotplug.
|
||||
|
||||
nps_mtm_hs_ctr= [KNL,ARC]
|
||||
nps_mtm_hs_ctr= [KNL,ARC]
|
||||
This parameter sets the maximum duration, in
|
||||
cycles, each HW thread of the CTOP can run
|
||||
without interruptions, before HW switches it.
|
||||
@ -2986,7 +2995,7 @@
|
||||
|
||||
pci=option[,option...] [PCI] various PCI subsystem options:
|
||||
earlydump [X86] dump PCI config space before the kernel
|
||||
changes anything
|
||||
changes anything
|
||||
off [X86] don't probe for the PCI bus
|
||||
bios [X86-32] force use of PCI BIOS, don't access
|
||||
the hardware directly. Use this if your machine
|
||||
@ -3074,7 +3083,7 @@
|
||||
is enabled by default. If you need to use this,
|
||||
please report a bug.
|
||||
nocrs [X86] Ignore PCI host bridge windows from ACPI.
|
||||
If you need to use this, please report a bug.
|
||||
If you need to use this, please report a bug.
|
||||
routeirq Do IRQ routing for all PCI devices.
|
||||
This is normally done in pci_enable_device(),
|
||||
so this option is a temporary workaround
|
||||
@ -3917,7 +3926,7 @@
|
||||
cache (risks via metadata attacks are mostly
|
||||
unchanged). Debug options disable merging on their
|
||||
own.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slab_max_order= [MM, SLAB]
|
||||
Determines the maximum allowed order for slabs.
|
||||
@ -3931,7 +3940,7 @@
|
||||
slub_debug can create guard zones around objects and
|
||||
may poison objects when not in use. Also tracks the
|
||||
last alloc / free. For more information see
|
||||
Documentation/vm/slub.txt.
|
||||
Documentation/vm/slub.rst.
|
||||
|
||||
slub_memcg_sysfs= [MM, SLUB]
|
||||
Determines whether to enable sysfs directories for
|
||||
@ -3945,7 +3954,7 @@
|
||||
Determines the maximum allowed order for slabs.
|
||||
A high setting may cause OOMs due to memory
|
||||
fragmentation. For more information see
|
||||
Documentation/vm/slub.txt.
|
||||
Documentation/vm/slub.rst.
|
||||
|
||||
slub_min_objects= [MM, SLUB]
|
||||
The minimum number of objects per slab. SLUB will
|
||||
@ -3954,12 +3963,12 @@
|
||||
the number of objects indicated. The higher the number
|
||||
of objects the smaller the overhead of tracking slabs
|
||||
and the less frequently locks need to be acquired.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slub_min_order= [MM, SLUB]
|
||||
Determines the minimum page order for slabs. Must be
|
||||
lower than slub_max_order.
|
||||
For more information see Documentation/vm/slub.txt.
|
||||
For more information see Documentation/vm/slub.rst.
|
||||
|
||||
slub_nomerge [MM, SLUB]
|
||||
Same with slab_nomerge. This is supported for legacy.
|
||||
@ -4357,7 +4366,8 @@
|
||||
Format: [always|madvise|never]
|
||||
Can be used to control the default behavior of the system
|
||||
with respect to transparent hugepages.
|
||||
See Documentation/vm/transhuge.txt for more details.
|
||||
See Documentation/admin-guide/mm/transhuge.rst
|
||||
for more details.
|
||||
|
||||
tsc= Disable clocksource stability checks for TSC.
|
||||
Format: <string>
|
||||
@ -4435,7 +4445,7 @@
|
||||
|
||||
usbcore.initial_descriptor_timeout=
|
||||
[USB] Specifies timeout for the initial 64-byte
|
||||
USB_REQ_GET_DESCRIPTOR request in milliseconds
|
||||
USB_REQ_GET_DESCRIPTOR request in milliseconds
|
||||
(default 5000 = 5.0 seconds).
|
||||
|
||||
usbcore.nousb [USB] Disable the USB subsystem
|
||||
|
222
Documentation/admin-guide/mm/concepts.rst
Normal file
222
Documentation/admin-guide/mm/concepts.rst
Normal file
@ -0,0 +1,222 @@
|
||||
.. _mm_concepts:
|
||||
|
||||
=================
|
||||
Concepts overview
|
||||
=================
|
||||
|
||||
The memory management in Linux is complex system that evolved over the
|
||||
years and included more and more functionality to support variety of
|
||||
systems from MMU-less microcontrollers to supercomputers. The memory
|
||||
management for systems without MMU is called ``nommu`` and it
|
||||
definitely deserves a dedicated document, which hopefully will be
|
||||
eventually written. Yet, although some of the concepts are the same,
|
||||
here we assume that MMU is available and CPU can translate a virtual
|
||||
address to a physical address.
|
||||
|
||||
.. contents:: :local:
|
||||
|
||||
Virtual Memory Primer
|
||||
=====================
|
||||
|
||||
The physical memory in a computer system is a limited resource and
|
||||
even for systems that support memory hotplug there is a hard limit on
|
||||
the amount of memory that can be installed. The physical memory is not
|
||||
necessary contiguous, it might be accessible as a set of distinct
|
||||
address ranges. Besides, different CPU architectures, and even
|
||||
different implementations of the same architecture have different view
|
||||
how these address ranges defined.
|
||||
|
||||
All this makes dealing directly with physical memory quite complex and
|
||||
to avoid this complexity a concept of virtual memory was developed.
|
||||
|
||||
The virtual memory abstracts the details of physical memory from the
|
||||
application software, allows to keep only needed information in the
|
||||
physical memory (demand paging) and provides a mechanism for the
|
||||
protection and controlled sharing of data between processes.
|
||||
|
||||
With virtual memory, each and every memory access uses a virtual
|
||||
address. When the CPU decodes the an instruction that reads (or
|
||||
writes) from (or to) the system memory, it translates the `virtual`
|
||||
address encoded in that instruction to a `physical` address that the
|
||||
memory controller can understand.
|
||||
|
||||
The physical system memory is divided into page frames, or pages. The
|
||||
size of each page is architecture specific. Some architectures allow
|
||||
selection of the page size from several supported values; this
|
||||
selection is performed at the kernel build time by setting an
|
||||
appropriate kernel configuration option.
|
||||
|
||||
Each physical memory page can be mapped as one or more virtual
|
||||
pages. These mappings are described by page tables that allow
|
||||
translation from virtual address used by programs to real address in
|
||||
the physical memory. The page tables organized hierarchically.
|
||||
|
||||
The tables at the lowest level of the hierarchy contain physical
|
||||
addresses of actual pages used by the software. The tables at higher
|
||||
levels contain physical addresses of the pages belonging to the lower
|
||||
levels. The pointer to the top level page table resides in a
|
||||
register. When the CPU performs the address translation, it uses this
|
||||
register to access the top level page table. The high bits of the
|
||||
virtual address are used to index an entry in the top level page
|
||||
table. That entry is then used to access the next level in the
|
||||
hierarchy with the next bits of the virtual address as the index to
|
||||
that level page table. The lowest bits in the virtual address define
|
||||
the offset inside the actual page.
|
||||
|
||||
Huge Pages
|
||||
==========
|
||||
|
||||
The address translation requires several memory accesses and memory
|
||||
accesses are slow relatively to CPU speed. To avoid spending precious
|
||||
processor cycles on the address translation, CPUs maintain a cache of
|
||||
such translations called Translation Lookaside Buffer (or
|
||||
TLB). Usually TLB is pretty scarce resource and applications with
|
||||
large memory working set will experience performance hit because of
|
||||
TLB misses.
|
||||
|
||||
Many modern CPU architectures allow mapping of the memory pages
|
||||
directly by the higher levels in the page table. For instance, on x86,
|
||||
it is possible to map 2M and even 1G pages using entries in the second
|
||||
and the third level page tables. In Linux such pages are called
|
||||
`huge`. Usage of huge pages significantly reduces pressure on TLB,
|
||||
improves TLB hit-rate and thus improves overall system performance.
|
||||
|
||||
There are two mechanisms in Linux that enable mapping of the physical
|
||||
memory with the huge pages. The first one is `HugeTLB filesystem`, or
|
||||
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
|
||||
store. For the files created in this filesystem the data resides in
|
||||
the memory and mapped using huge pages. The hugetlbfs is described at
|
||||
:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
|
||||
|
||||
Another, more recent, mechanism that enables use of the huge pages is
|
||||
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
|
||||
requires users and/or system administrators to configure what parts of
|
||||
the system memory should and can be mapped by the huge pages, THP
|
||||
manages such mappings transparently to the user and hence the
|
||||
name. See
|
||||
:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
|
||||
for more details about THP.
|
||||
|
||||
Zones
|
||||
=====
|
||||
|
||||
Often hardware poses restrictions on how different physical memory
|
||||
ranges can be accessed. In some cases, devices cannot perform DMA to
|
||||
all the addressable memory. In other cases, the size of the physical
|
||||
memory exceeds the maximal addressable size of virtual memory and
|
||||
special actions are required to access portions of the memory. Linux
|
||||
groups memory pages into `zones` according to their possible
|
||||
usage. For example, ZONE_DMA will contain memory that can be used by
|
||||
devices for DMA, ZONE_HIGHMEM will contain memory that is not
|
||||
permanently mapped into kernel's address space and ZONE_NORMAL will
|
||||
contain normally addressed pages.
|
||||
|
||||
The actual layout of the memory zones is hardware dependent as not all
|
||||
architectures define all zones, and requirements for DMA are different
|
||||
for different platforms.
|
||||
|
||||
Nodes
|
||||
=====
|
||||
|
||||
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
|
||||
systems. In such systems the memory is arranged into banks that have
|
||||
different access latency depending on the "distance" from the
|
||||
processor. Each bank is referred as `node` and for each node Linux
|
||||
constructs an independent memory management subsystem. A node has it's
|
||||
own set of zones, lists of free and used pages and various statistics
|
||||
counters. You can find more details about NUMA in
|
||||
:ref:`Documentation/vm/numa.rst <numa>` and in
|
||||
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
|
||||
|
||||
Page cache
|
||||
==========
|
||||
|
||||
The physical memory is volatile and the common case for getting data
|
||||
into the memory is to read it from files. Whenever a file is read, the
|
||||
data is put into the `page cache` to avoid expensive disk access on
|
||||
the subsequent reads. Similarly, when one writes to a file, the data
|
||||
is placed in the page cache and eventually gets into the backing
|
||||
storage device. The written pages are marked as `dirty` and when Linux
|
||||
decides to reuse them for other purposes, it makes sure to synchronize
|
||||
the file contents on the device with the updated data.
|
||||
|
||||
Anonymous Memory
|
||||
================
|
||||
|
||||
The `anonymous memory` or `anonymous mappings` represent memory that
|
||||
is not backed by a filesystem. Such mappings are implicitly created
|
||||
for program's stack and heap or by explicit calls to mmap(2) system
|
||||
call. Usually, the anonymous mappings only define virtual memory areas
|
||||
that the program is allowed to access. The read accesses will result
|
||||
in creation of a page table entry that references a special physical
|
||||
page filled with zeroes. When the program performs a write, regular
|
||||
physical page will be allocated to hold the written data. The page
|
||||
will be marked dirty and if the kernel will decide to repurpose it,
|
||||
the dirty page will be swapped out.
|
||||
|
||||
Reclaim
|
||||
=======
|
||||
|
||||
Throughout the system lifetime, a physical page can be used for storing
|
||||
different types of data. It can be kernel internal data structures,
|
||||
DMA'able buffers for device drivers use, data read from a filesystem,
|
||||
memory allocated by user space processes etc.
|
||||
|
||||
Depending on the page usage it is treated differently by the Linux
|
||||
memory management. The pages that can be freed at any time, either
|
||||
because they cache the data available elsewhere, for instance, on a
|
||||
hard disk, or because they can be swapped out, again, to the hard
|
||||
disk, are called `reclaimable`. The most notable categories of the
|
||||
reclaimable pages are page cache and anonymous memory.
|
||||
|
||||
In most cases, the pages holding internal kernel data and used as DMA
|
||||
buffers cannot be repurposed, and they remain pinned until freed by
|
||||
their user. Such pages are called `unreclaimable`. However, in certain
|
||||
circumstances, even pages occupied with kernel data structures can be
|
||||
reclaimed. For instance, in-memory caches of filesystem metadata can
|
||||
be re-read from the storage device and therefore it is possible to
|
||||
discard them from the main memory when system is under memory
|
||||
pressure.
|
||||
|
||||
The process of freeing the reclaimable physical memory pages and
|
||||
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
|
||||
pages either asynchronously or synchronously, depending on the state
|
||||
of the system. When system is not loaded, most of the memory is free
|
||||
and allocation request will be satisfied immediately from the free
|
||||
pages supply. As the load increases, the amount of the free pages goes
|
||||
down and when it reaches a certain threshold (high watermark), an
|
||||
allocation request will awaken the ``kswapd`` daemon. It will
|
||||
asynchronously scan memory pages and either just free them if the data
|
||||
they contain is available elsewhere, or evict to the backing storage
|
||||
device (remember those dirty pages?). As memory usage increases even
|
||||
more and reaches another threshold - min watermark - an allocation
|
||||
will trigger the `direct reclaim`. In this case allocation is stalled
|
||||
until enough memory pages are reclaimed to satisfy the request.
|
||||
|
||||
Compaction
|
||||
==========
|
||||
|
||||
As the system runs, tasks allocate and free the memory and it becomes
|
||||
fragmented. Although with virtual memory it is possible to present
|
||||
scattered physical pages as virtually contiguous range, sometimes it is
|
||||
necessary to allocate large physically contiguous memory areas. Such
|
||||
need may arise, for instance, when a device driver requires large
|
||||
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
|
||||
addresses the fragmentation issue. This mechanism moves occupied pages
|
||||
from the lower part of a memory zone to free pages in the upper part
|
||||
of the zone. When a compaction scan is finished free pages are grouped
|
||||
together at the beginning of the zone and allocations of large
|
||||
physically contiguous areas become possible.
|
||||
|
||||
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
|
||||
daemon or synchronously as a result of memory allocation request.
|
||||
|
||||
OOM killer
|
||||
==========
|
||||
|
||||
It may happen, that on a loaded machine memory will be exhausted. When
|
||||
the kernel detects that the system runs out of memory (OOM) it invokes
|
||||
`OOM killer`. Its mission is simple: all it has to do is to select a
|
||||
task to sacrifice for the sake of the overall system health. The
|
||||
selected task is killed in a hope that after it exits enough memory
|
||||
will be freed to continue normal operation.
|
@ -1,3 +1,11 @@
|
||||
.. _hugetlbpage:
|
||||
|
||||
=============
|
||||
HugeTLB Pages
|
||||
=============
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The intent of this file is to give a brief summary of hugetlbpage support in
|
||||
the Linux kernel. This support is built on top of multiple page size support
|
||||
@ -18,53 +26,59 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
|
||||
automatically when CONFIG_HUGETLBFS is selected) configuration
|
||||
options.
|
||||
|
||||
The /proc/meminfo file provides information about the total number of
|
||||
The ``/proc/meminfo`` file provides information about the total number of
|
||||
persistent hugetlb pages in the kernel's huge page pool. It also displays
|
||||
default huge page size and information about the number of free, reserved
|
||||
and surplus huge pages in the pool of huge pages of default size.
|
||||
The huge page size is needed for generating the proper alignment and
|
||||
size of the arguments to system calls that map huge page regions.
|
||||
|
||||
The output of "cat /proc/meminfo" will include lines like:
|
||||
The output of ``cat /proc/meminfo`` will include lines like::
|
||||
|
||||
.....
|
||||
HugePages_Total: uuu
|
||||
HugePages_Free: vvv
|
||||
HugePages_Rsvd: www
|
||||
HugePages_Surp: xxx
|
||||
Hugepagesize: yyy kB
|
||||
Hugetlb: zzz kB
|
||||
HugePages_Total: uuu
|
||||
HugePages_Free: vvv
|
||||
HugePages_Rsvd: www
|
||||
HugePages_Surp: xxx
|
||||
Hugepagesize: yyy kB
|
||||
Hugetlb: zzz kB
|
||||
|
||||
where:
|
||||
HugePages_Total is the size of the pool of huge pages.
|
||||
HugePages_Free is the number of huge pages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd is short for "reserved," and is the number of huge pages for
|
||||
which a commitment to allocate from the pool has been made,
|
||||
but no allocation has yet been made. Reserved huge pages
|
||||
guarantee that an application will be able to allocate a
|
||||
huge page from the pool of huge pages at fault time.
|
||||
HugePages_Surp is short for "surplus," and is the number of huge pages in
|
||||
the pool above the value in /proc/sys/vm/nr_hugepages. The
|
||||
maximum number of surplus huge pages is controlled by
|
||||
/proc/sys/vm/nr_overcommit_hugepages.
|
||||
Hugepagesize is the default hugepage size (in Kb).
|
||||
Hugetlb is the total amount of memory (in kB), consumed by huge
|
||||
pages of all sizes.
|
||||
If huge pages of different sizes are in use, this number
|
||||
will exceed HugePages_Total * Hugepagesize. To get more
|
||||
detailed information, please, refer to
|
||||
/sys/kernel/mm/hugepages (described below).
|
||||
|
||||
HugePages_Total
|
||||
is the size of the pool of huge pages.
|
||||
HugePages_Free
|
||||
is the number of huge pages in the pool that are not yet
|
||||
allocated.
|
||||
HugePages_Rsvd
|
||||
is short for "reserved," and is the number of huge pages for
|
||||
which a commitment to allocate from the pool has been made,
|
||||
but no allocation has yet been made. Reserved huge pages
|
||||
guarantee that an application will be able to allocate a
|
||||
huge page from the pool of huge pages at fault time.
|
||||
HugePages_Surp
|
||||
is short for "surplus," and is the number of huge pages in
|
||||
the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
|
||||
maximum number of surplus huge pages is controlled by
|
||||
``/proc/sys/vm/nr_overcommit_hugepages``.
|
||||
Hugepagesize
|
||||
is the default hugepage size (in Kb).
|
||||
Hugetlb
|
||||
is the total amount of memory (in kB), consumed by huge
|
||||
pages of all sizes.
|
||||
If huge pages of different sizes are in use, this number
|
||||
will exceed HugePages_Total \* Hugepagesize. To get more
|
||||
detailed information, please, refer to
|
||||
``/sys/kernel/mm/hugepages`` (described below).
|
||||
|
||||
|
||||
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
|
||||
in the kernel.
|
||||
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
|
||||
configured in the kernel.
|
||||
|
||||
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
|
||||
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
|
||||
pages in the kernel's huge page pool. "Persistent" huge pages will be
|
||||
returned to the huge page pool when freed by a task. A user with root
|
||||
privileges can dynamically allocate more or free some persistent huge pages
|
||||
by increasing or decreasing the value of 'nr_hugepages'.
|
||||
by increasing or decreasing the value of ``nr_hugepages``.
|
||||
|
||||
Pages that are used as huge pages are reserved inside the kernel and cannot
|
||||
be used for other purposes. Huge pages cannot be swapped out under
|
||||
@ -73,7 +87,7 @@ memory pressure.
|
||||
Once a number of huge pages have been pre-allocated to the kernel huge page
|
||||
pool, a user with appropriate privilege can use either the mmap system call
|
||||
or shared memory system calls to use the huge pages. See the discussion of
|
||||
Using Huge Pages, below.
|
||||
:ref:`Using Huge Pages <using_huge_pages>`, below.
|
||||
|
||||
The administrator can allocate persistent huge pages on the kernel boot
|
||||
command line by specifying the "hugepages=N" parameter, where 'N' = the
|
||||
@ -86,10 +100,10 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must
|
||||
be specified in bytes with optional scale suffix [kKmMgG]. The default huge
|
||||
page size may be selected with the "default_hugepagesz=<size>" boot parameter.
|
||||
|
||||
When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
|
||||
When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
|
||||
indicates the current number of pre-allocated huge pages of the default size.
|
||||
Thus, one can use the following command to dynamically allocate/deallocate
|
||||
default sized persistent huge pages:
|
||||
default sized persistent huge pages::
|
||||
|
||||
echo 20 > /proc/sys/vm/nr_hugepages
|
||||
|
||||
@ -98,11 +112,12 @@ huge page pool to 20, allocating or freeing huge pages, as required.
|
||||
|
||||
On a NUMA platform, the kernel will attempt to distribute the huge page pool
|
||||
over all the set of allowed nodes specified by the NUMA memory policy of the
|
||||
task that modifies nr_hugepages. The default for the allowed nodes--when the
|
||||
task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
|
||||
task has default memory policy--is all on-line nodes with memory. Allowed
|
||||
nodes with insufficient available, contiguous memory for a huge page will be
|
||||
silently skipped when allocating persistent huge pages. See the discussion
|
||||
below of the interaction of task memory policy, cpusets and per node attributes
|
||||
silently skipped when allocating persistent huge pages. See the
|
||||
:ref:`discussion below <mem_policy_and_hp_alloc>`
|
||||
of the interaction of task memory policy, cpusets and per node attributes
|
||||
with the allocation and freeing of persistent huge pages.
|
||||
|
||||
The success or failure of huge page allocation depends on the amount of
|
||||
@ -117,51 +132,52 @@ init files. This will enable the kernel to allocate huge pages early in
|
||||
the boot process when the possibility of getting physical contiguous pages
|
||||
is still very high. Administrators can verify the number of huge pages
|
||||
actually allocated by checking the sysctl or meminfo. To check the per node
|
||||
distribution of huge pages in a NUMA system, use:
|
||||
distribution of huge pages in a NUMA system, use::
|
||||
|
||||
cat /sys/devices/system/node/node*/meminfo | fgrep Huge
|
||||
|
||||
/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
|
||||
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
|
||||
``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
|
||||
huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
|
||||
requested by applications. Writing any non-zero value into this file
|
||||
indicates that the hugetlb subsystem is allowed to try to obtain that
|
||||
number of "surplus" huge pages from the kernel's normal page pool, when the
|
||||
persistent huge page pool is exhausted. As these surplus huge pages become
|
||||
unused, they are freed back to the kernel's normal page pool.
|
||||
|
||||
When increasing the huge page pool size via nr_hugepages, any existing surplus
|
||||
pages will first be promoted to persistent huge pages. Then, additional
|
||||
When increasing the huge page pool size via ``nr_hugepages``, any existing
|
||||
surplus pages will first be promoted to persistent huge pages. Then, additional
|
||||
huge pages will be allocated, if necessary and if possible, to fulfill
|
||||
the new persistent huge page pool size.
|
||||
|
||||
The administrator may shrink the pool of persistent huge pages for
|
||||
the default huge page size by setting the nr_hugepages sysctl to a
|
||||
the default huge page size by setting the ``nr_hugepages`` sysctl to a
|
||||
smaller value. The kernel will attempt to balance the freeing of huge pages
|
||||
across all nodes in the memory policy of the task modifying nr_hugepages.
|
||||
across all nodes in the memory policy of the task modifying ``nr_hugepages``.
|
||||
Any free huge pages on the selected nodes will be freed back to the kernel's
|
||||
normal page pool.
|
||||
|
||||
Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
|
||||
Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
|
||||
it becomes less than the number of huge pages in use will convert the balance
|
||||
of the in-use huge pages to surplus huge pages. This will occur even if
|
||||
the number of surplus pages it would exceed the overcommit value. As long as
|
||||
this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
|
||||
the number of surplus pages would exceed the overcommit value. As long as
|
||||
this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
|
||||
increased sufficiently, or the surplus huge pages go out of use and are freed--
|
||||
no more surplus huge pages will be allowed to be allocated.
|
||||
|
||||
With support for multiple huge page pools at run-time available, much of
|
||||
the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
|
||||
The /proc interfaces discussed above have been retained for backwards
|
||||
compatibility. The root huge page control directory in sysfs is:
|
||||
the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
|
||||
sysfs.
|
||||
The ``/proc`` interfaces discussed above have been retained for backwards
|
||||
compatibility. The root huge page control directory in sysfs is::
|
||||
|
||||
/sys/kernel/mm/hugepages
|
||||
|
||||
For each huge page size supported by the running kernel, a subdirectory
|
||||
will exist, of the form:
|
||||
will exist, of the form::
|
||||
|
||||
hugepages-${size}kB
|
||||
|
||||
Inside each of these directories, the same set of files will exist:
|
||||
Inside each of these directories, the same set of files will exist::
|
||||
|
||||
nr_hugepages
|
||||
nr_hugepages_mempolicy
|
||||
@ -172,37 +188,39 @@ Inside each of these directories, the same set of files will exist:
|
||||
|
||||
which function as described above for the default huge page-sized case.
|
||||
|
||||
.. _mem_policy_and_hp_alloc:
|
||||
|
||||
Interaction of Task Memory Policy with Huge Page Allocation/Freeing
|
||||
===================================================================
|
||||
|
||||
Whether huge pages are allocated and freed via the /proc interface or
|
||||
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
|
||||
nodes from which huge pages are allocated or freed are controlled by the
|
||||
NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
|
||||
sysctl or attribute. When the nr_hugepages attribute is used, mempolicy
|
||||
Whether huge pages are allocated and freed via the ``/proc`` interface or
|
||||
the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
|
||||
NUMA nodes from which huge pages are allocated or freed are controlled by the
|
||||
NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
|
||||
sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
|
||||
is ignored.
|
||||
|
||||
The recommended method to allocate or free huge pages to/from the kernel
|
||||
huge page pool, using the nr_hugepages example above, is:
|
||||
huge page pool, using the ``nr_hugepages`` example above, is::
|
||||
|
||||
numactl --interleave <node-list> echo 20 \
|
||||
>/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
or, more succinctly:
|
||||
or, more succinctly::
|
||||
|
||||
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
|
||||
|
||||
This will allocate or free abs(20 - nr_hugepages) to or from the nodes
|
||||
This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
|
||||
specified in <node-list>, depending on whether number of persistent huge pages
|
||||
is initially less than or greater than 20, respectively. No huge pages will be
|
||||
allocated nor freed on any node not included in the specified <node-list>.
|
||||
|
||||
When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
|
||||
When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
|
||||
memory policy mode--bind, preferred, local or interleave--may be used. The
|
||||
resulting effect on persistent huge page allocation is as follows:
|
||||
|
||||
1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
|
||||
#. Regardless of mempolicy mode [see
|
||||
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
|
||||
persistent huge pages will be distributed across the node or nodes
|
||||
specified in the mempolicy as if "interleave" had been specified.
|
||||
However, if a node in the policy does not contain sufficient contiguous
|
||||
@ -212,7 +230,7 @@ resulting effect on persistent huge page allocation is as follows:
|
||||
possibly, allocation of persistent huge pages on nodes not allowed by
|
||||
the task's memory policy.
|
||||
|
||||
2) One or more nodes may be specified with the bind or interleave policy.
|
||||
#. One or more nodes may be specified with the bind or interleave policy.
|
||||
If more than one node is specified with the preferred policy, only the
|
||||
lowest numeric id will be used. Local policy will select the node where
|
||||
the task is running at the time the nodes_allowed mask is constructed.
|
||||
@ -222,20 +240,20 @@ resulting effect on persistent huge page allocation is as follows:
|
||||
indeterminate. Thus, local policy is not very useful for this purpose.
|
||||
Any of the other mempolicy modes may be used to specify a single node.
|
||||
|
||||
3) The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||
#. The nodes allowed mask will be derived from any non-default task mempolicy,
|
||||
whether this policy was set explicitly by the task itself or one of its
|
||||
ancestors, such as numactl. This means that if the task is invoked from a
|
||||
shell with non-default policy, that policy will be used. One can specify a
|
||||
node list of "all" with numactl --interleave or --membind [-m] to achieve
|
||||
interleaving over all nodes in the system or cpuset.
|
||||
|
||||
4) Any task mempolicy specified--e.g., using numactl--will be constrained by
|
||||
#. Any task mempolicy specified--e.g., using numactl--will be constrained by
|
||||
the resource limits of any cpuset in which the task runs. Thus, there will
|
||||
be no way for a task with non-default policy running in a cpuset with a
|
||||
subset of the system nodes to allocate huge pages outside the cpuset
|
||||
without first moving to a cpuset that contains all of the desired nodes.
|
||||
|
||||
5) Boot-time huge page allocation attempts to distribute the requested number
|
||||
#. Boot-time huge page allocation attempts to distribute the requested number
|
||||
of huge pages over all on-lines nodes with memory.
|
||||
|
||||
Per Node Hugepages Attributes
|
||||
@ -243,22 +261,22 @@ Per Node Hugepages Attributes
|
||||
|
||||
A subset of the contents of the root huge page control directory in sysfs,
|
||||
described above, will be replicated under each the system device of each
|
||||
NUMA node with memory in:
|
||||
NUMA node with memory in::
|
||||
|
||||
/sys/devices/system/node/node[0-9]*/hugepages/
|
||||
|
||||
Under this directory, the subdirectory for each supported huge page size
|
||||
contains the following attribute files:
|
||||
contains the following attribute files::
|
||||
|
||||
nr_hugepages
|
||||
free_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
The free_' and surplus_' attribute files are read-only. They return the number
|
||||
The free\_' and surplus\_' attribute files are read-only. They return the number
|
||||
of free and surplus [overcommitted] huge pages, respectively, on the parent
|
||||
node.
|
||||
|
||||
The nr_hugepages attribute returns the total number of huge pages on the
|
||||
The ``nr_hugepages`` attribute returns the total number of huge pages on the
|
||||
specified node. When this attribute is written, the number of persistent huge
|
||||
pages on the parent node will be adjusted to the specified value, if sufficient
|
||||
resources exist, regardless of the task's mempolicy or cpuset constraints.
|
||||
@ -267,43 +285,58 @@ Note that the number of overcommit and reserve pages remain global quantities,
|
||||
as we don't know until fault time, when the faulting task's mempolicy is
|
||||
applied, from which node the huge page allocation will be attempted.
|
||||
|
||||
.. _using_huge_pages:
|
||||
|
||||
Using Huge Pages
|
||||
================
|
||||
|
||||
If the user applications are going to request huge pages using mmap system
|
||||
call, then it is required that system administrator mount a file system of
|
||||
type hugetlbfs:
|
||||
type hugetlbfs::
|
||||
|
||||
mount -t hugetlbfs \
|
||||
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||
|
||||
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
|
||||
/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid
|
||||
options sets the owner and group of the root of the file system. By default
|
||||
the uid and gid of the current process are taken. The mode option sets the
|
||||
mode of root of file system to value & 01777. This value is given in octal.
|
||||
By default the value 0755 is picked. If the platform supports multiple huge
|
||||
page sizes, the pagesize option can be used to specify the huge page size and
|
||||
associated pool. pagesize is specified in bytes. If pagesize is not specified
|
||||
the platform's default huge page size and associated pool will be used. The
|
||||
size option sets the maximum value of memory (huge pages) allowed for that
|
||||
filesystem (/mnt/huge). The size option can be specified in bytes, or as a
|
||||
percentage of the specified huge page pool (nr_hugepages). The size is
|
||||
rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum
|
||||
value of memory (huge pages) allowed for the filesystem. min_size can be
|
||||
specified in the same way as size, either bytes or a percentage of the
|
||||
huge page pool. At mount time, the number of huge pages specified by
|
||||
min_size are reserved for use by the filesystem. If there are not enough
|
||||
free huge pages available, the mount will fail. As huge pages are allocated
|
||||
to the filesystem and freed, the reserve count is adjusted so that the sum
|
||||
of allocated and reserved huge pages is always at least min_size. The option
|
||||
nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the
|
||||
size, min_size or nr_inodes option is not provided on command line then
|
||||
no limits are set. For pagesize, size, min_size and nr_inodes options, you
|
||||
can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K
|
||||
has the same meaning as size=2048.
|
||||
``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
|
||||
|
||||
The ``uid`` and ``gid`` options sets the owner and group of the root of the
|
||||
file system. By default the ``uid`` and ``gid`` of the current process
|
||||
are taken.
|
||||
|
||||
The ``mode`` option sets the mode of root of file system to value & 01777.
|
||||
This value is given in octal. By default the value 0755 is picked.
|
||||
|
||||
If the platform supports multiple huge page sizes, the ``pagesize`` option can
|
||||
be used to specify the huge page size and associated pool. ``pagesize``
|
||||
is specified in bytes. If ``pagesize`` is not specified the platform's
|
||||
default huge page size and associated pool will be used.
|
||||
|
||||
The ``size`` option sets the maximum value of memory (huge pages) allowed
|
||||
for that filesystem (``/mnt/huge``). The ``size`` option can be specified
|
||||
in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
|
||||
The size is rounded down to HPAGE_SIZE boundary.
|
||||
|
||||
The ``min_size`` option sets the minimum value of memory (huge pages) allowed
|
||||
for the filesystem. ``min_size`` can be specified in the same way as ``size``,
|
||||
either bytes or a percentage of the huge page pool.
|
||||
At mount time, the number of huge pages specified by ``min_size`` are reserved
|
||||
for use by the filesystem.
|
||||
If there are not enough free huge pages available, the mount will fail.
|
||||
As huge pages are allocated to the filesystem and freed, the reserve count
|
||||
is adjusted so that the sum of allocated and reserved huge pages is always
|
||||
at least ``min_size``.
|
||||
|
||||
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
|
||||
can use.
|
||||
|
||||
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
|
||||
command line then no limits are set.
|
||||
|
||||
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
|
||||
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
|
||||
For example, size=2K has the same meaning as size=2048.
|
||||
|
||||
While read system calls are supported on files that reside on hugetlb
|
||||
file systems, write system calls are not.
|
||||
@ -313,12 +346,12 @@ used to change the file attributes on hugetlbfs.
|
||||
|
||||
Also, it is important to note that no such mount command is required if
|
||||
applications are going to use only shmat/shmget system calls or mmap with
|
||||
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb
|
||||
below.
|
||||
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
|
||||
:ref:`map_hugetlb <map_hugetlb>` below.
|
||||
|
||||
Users who wish to use hugetlb memory via shared memory segment should be a
|
||||
member of a supplementary group and system admin needs to configure that gid
|
||||
into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different
|
||||
Users who wish to use hugetlb memory via shared memory segment should be
|
||||
members of a supplementary group and system admin needs to configure that gid
|
||||
into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
|
||||
applications to use any combination of mmaps and shm* calls, though the mount of
|
||||
filesystem will be required for using mmap calls without MAP_HUGETLB.
|
||||
|
||||
@ -332,20 +365,18 @@ a hugetlb page and the length is smaller than the hugepage size.
|
||||
Examples
|
||||
========
|
||||
|
||||
1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c
|
||||
.. _map_hugetlb:
|
||||
|
||||
2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c
|
||||
``map_hugetlb``
|
||||
see tools/testing/selftests/vm/map_hugetlb.c
|
||||
|
||||
3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
``hugepage-shm``
|
||||
see tools/testing/selftests/vm/hugepage-shm.c
|
||||
|
||||
4) The libhugetlbfs (https://github.com/libhugetlbfs/libhugetlbfs) library
|
||||
provides a wide range of userspace tools to help with huge page usability,
|
||||
environment setup, and control.
|
||||
``hugepage-mmap``
|
||||
see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
|
||||
Kernel development regression testing
|
||||
=====================================
|
||||
The `libhugetlbfs`_ library provides a wide range of userspace tools
|
||||
to help with huge page usability, environment setup, and control.
|
||||
|
||||
The most complete set of hugetlb tests are in the libhugetlbfs repository.
|
||||
If you modify any hugetlb related code, use the libhugetlbfs test suite
|
||||
to check for regressions. In addition, if you add any new hugetlb
|
||||
functionality, please add appropriate tests to libhugetlbfs.
|
||||
.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
|
@ -1,4 +1,11 @@
|
||||
MOTIVATION
|
||||
.. _idle_page_tracking:
|
||||
|
||||
==================
|
||||
Idle Page Tracking
|
||||
==================
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The idle page tracking feature allows to track which memory pages are being
|
||||
accessed by a workload and which are idle. This information can be useful for
|
||||
@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster.
|
||||
|
||||
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
|
||||
|
||||
USER API
|
||||
.. _user_api:
|
||||
|
||||
The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently,
|
||||
it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap.
|
||||
User API
|
||||
========
|
||||
|
||||
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
|
||||
Currently, it consists of the only read-write file,
|
||||
``/sys/kernel/mm/page_idle/bitmap``.
|
||||
|
||||
The file implements a bitmap where each bit corresponds to a memory page. The
|
||||
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
|
||||
@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
|
||||
set, the corresponding page is idle.
|
||||
|
||||
A page is considered idle if it has not been accessed since it was marked idle
|
||||
(for more details on what "accessed" actually means see the IMPLEMENTATION
|
||||
DETAILS section). To mark a page idle one has to set the bit corresponding to
|
||||
(for more details on what "accessed" actually means see the :ref:`Implementation
|
||||
Details <impl_details>` section).
|
||||
To mark a page idle one has to set the bit corresponding to
|
||||
the page by writing to the file. A value written to the file is OR-ed with the
|
||||
current bitmap value.
|
||||
|
||||
@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
|
||||
and hence such pages are never reported idle.
|
||||
|
||||
For huge pages the idle flag is set only on the head page, so one has to read
|
||||
/proc/kpageflags in order to correctly count idle huge pages.
|
||||
``/proc/kpageflags`` in order to correctly count idle huge pages.
|
||||
|
||||
Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return
|
||||
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
|
||||
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
|
||||
if the size of the read/write is not a multiple of 8 bytes. Writing to
|
||||
this file beyond max PFN will return -ENXIO.
|
||||
@ -41,21 +53,26 @@ That said, in order to estimate the amount of pages that are not used by a
|
||||
workload one should:
|
||||
|
||||
1. Mark all the workload's pages as idle by setting corresponding bits in
|
||||
/sys/kernel/mm/page_idle/bitmap. The pages can be found by reading
|
||||
/proc/pid/pagemap if the workload is represented by a process, or by
|
||||
filtering out alien pages using /proc/kpagecgroup in case the workload is
|
||||
placed in a memory cgroup.
|
||||
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
|
||||
``/proc/pid/pagemap`` if the workload is represented by a process, or by
|
||||
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
|
||||
is placed in a memory cgroup.
|
||||
|
||||
2. Wait until the workload accesses its working set.
|
||||
|
||||
3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If
|
||||
one wants to ignore certain types of pages, e.g. mlocked pages since they
|
||||
are not reclaimable, he or she can filter them out using /proc/kpageflags.
|
||||
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
|
||||
If one wants to ignore certain types of pages, e.g. mlocked pages since they
|
||||
are not reclaimable, he or she can filter them out using
|
||||
``/proc/kpageflags``.
|
||||
|
||||
See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap,
|
||||
/proc/kpageflags, and /proc/kpagecgroup.
|
||||
See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
|
||||
information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
|
||||
``/proc/kpagecgroup``.
|
||||
|
||||
IMPLEMENTATION DETAILS
|
||||
.. _impl_details:
|
||||
|
||||
Implementation Details
|
||||
======================
|
||||
|
||||
The kernel internally keeps track of accesses to user memory pages in order to
|
||||
reclaim unreferenced pages first on memory shortage conditions. A page is
|
||||
@ -77,7 +94,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
|
||||
exceeding the dirty memory limit, it is not marked referenced.
|
||||
|
||||
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
|
||||
is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API
|
||||
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
|
||||
:ref:`User API <user_api>`
|
||||
section), and cleared automatically whenever a page is referenced as defined
|
||||
above.
|
||||
|
36
Documentation/admin-guide/mm/index.rst
Normal file
36
Documentation/admin-guide/mm/index.rst
Normal file
@ -0,0 +1,36 @@
|
||||
=================
|
||||
Memory Management
|
||||
=================
|
||||
|
||||
Linux memory management subsystem is responsible, as the name implies,
|
||||
for managing the memory in the system. This includes implemnetation of
|
||||
virtual memory and demand paging, memory allocation both for kernel
|
||||
internal structures and user space programms, mapping of files into
|
||||
processes address space and many other cool things.
|
||||
|
||||
Linux memory management is a complex system with many configurable
|
||||
settings. Most of these settings are available via ``/proc``
|
||||
filesystem and can be quired and adjusted using ``sysctl``. These APIs
|
||||
are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
|
||||
|
||||
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
|
||||
|
||||
Linux memory management has its own jargon and if you are not yet
|
||||
familiar with it, consider reading
|
||||
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
|
||||
|
||||
Here we document in detail how to interact with various mechanisms in
|
||||
the Linux memory management.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
concepts
|
||||
hugetlbpage
|
||||
idle_page_tracking
|
||||
ksm
|
||||
numa_memory_policy
|
||||
pagemap
|
||||
soft-dirty
|
||||
transhuge
|
||||
userfaultfd
|
189
Documentation/admin-guide/mm/ksm.rst
Normal file
189
Documentation/admin-guide/mm/ksm.rst
Normal file
@ -0,0 +1,189 @@
|
||||
.. _admin_guide_ksm:
|
||||
|
||||
=======================
|
||||
Kernel Samepage Merging
|
||||
=======================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
|
||||
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
|
||||
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
|
||||
|
||||
KSM was originally developed for use with KVM (where it was known as
|
||||
Kernel Shared Memory), to fit more virtual machines into physical memory,
|
||||
by sharing the data common between them. But it can be useful to any
|
||||
application which generates many instances of the same data.
|
||||
|
||||
The KSM daemon ksmd periodically scans those areas of user memory
|
||||
which have been registered with it, looking for pages of identical
|
||||
content which can be replaced by a single write-protected page (which
|
||||
is automatically copied if a process later wants to update its
|
||||
content). The amount of pages that KSM daemon scans in a single pass
|
||||
and the time between the passes are configured using :ref:`sysfs
|
||||
intraface <ksm_sysfs>`
|
||||
|
||||
KSM only merges anonymous (private) pages, never pagecache (file) pages.
|
||||
KSM's merged pages were originally locked into kernel memory, but can now
|
||||
be swapped out just like other user pages (but sharing is broken when they
|
||||
are swapped back in: ksmd must rediscover their identity and merge again).
|
||||
|
||||
Controlling KSM with madvise
|
||||
============================
|
||||
|
||||
KSM only operates on those areas of address space which an application
|
||||
has advised to be likely candidates for merging, by using the madvise(2)
|
||||
system call::
|
||||
|
||||
int madvise(addr, length, MADV_MERGEABLE)
|
||||
|
||||
The app may call
|
||||
|
||||
::
|
||||
|
||||
int madvise(addr, length, MADV_UNMERGEABLE)
|
||||
|
||||
to cancel that advice and restore unshared pages: whereupon KSM
|
||||
unmerges whatever it merged in that range. Note: this unmerging call
|
||||
may suddenly require more memory than is available - possibly failing
|
||||
with EAGAIN, but more probably arousing the Out-Of-Memory killer.
|
||||
|
||||
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
|
||||
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
|
||||
built with CONFIG_KSM=y, those calls will normally succeed: even if the
|
||||
the KSM daemon is not currently running, MADV_MERGEABLE still registers
|
||||
the range for whenever the KSM daemon is started; even if the range
|
||||
cannot contain any pages which KSM could actually merge; even if
|
||||
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
|
||||
|
||||
If a region of memory must be split into at least one new MADV_MERGEABLE
|
||||
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
|
||||
will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
|
||||
|
||||
Like other madvise calls, they are intended for use on mapped areas of
|
||||
the user address space: they will report ENOMEM if the specified range
|
||||
includes unmapped gaps (though working on the intervening mapped areas),
|
||||
and might fail with EAGAIN if not enough memory for internal structures.
|
||||
|
||||
Applications should be considerate in their use of MADV_MERGEABLE,
|
||||
restricting its use to areas likely to benefit. KSM's scans may use a lot
|
||||
of processing power: some installations will disable KSM for that reason.
|
||||
|
||||
.. _ksm_sysfs:
|
||||
|
||||
KSM daemon sysfs interface
|
||||
==========================
|
||||
|
||||
The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
|
||||
readable by all but writable only by root:
|
||||
|
||||
pages_to_scan
|
||||
how many pages to scan before ksmd goes to sleep
|
||||
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
||||
|
||||
Default: 100 (chosen for demonstration purposes)
|
||||
|
||||
sleep_millisecs
|
||||
how many milliseconds ksmd should sleep before next scan
|
||||
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
|
||||
|
||||
Default: 20 (chosen for demonstration purposes)
|
||||
|
||||
merge_across_nodes
|
||||
specifies if pages from different NUMA nodes can be merged.
|
||||
When set to 0, ksm merges only pages which physically reside
|
||||
in the memory area of same NUMA node. That brings lower
|
||||
latency to access of shared pages. Systems with more nodes, at
|
||||
significant NUMA distances, are likely to benefit from the
|
||||
lower latency of setting 0. Smaller systems, which need to
|
||||
minimize memory usage, are likely to benefit from the greater
|
||||
sharing of setting 1 (default). You may wish to compare how
|
||||
your system performs under each setting, before deciding on
|
||||
which to use. ``merge_across_nodes`` setting can be changed only
|
||||
when there are no ksm shared pages in the system: set run 2 to
|
||||
unmerge pages first, then to 1 after changing
|
||||
``merge_across_nodes``, to remerge according to the new setting.
|
||||
|
||||
Default: 1 (merging across nodes as in earlier releases)
|
||||
|
||||
run
|
||||
* set to 0 to stop ksmd from running but keep merged pages,
|
||||
* set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
|
||||
* set to 2 to stop ksmd and unmerge all pages currently merged, but
|
||||
leave mergeable areas registered for next run.
|
||||
|
||||
Default: 0 (must be changed to 1 to activate KSM, except if
|
||||
CONFIG_SYSFS is disabled)
|
||||
|
||||
use_zero_pages
|
||||
specifies whether empty pages (i.e. allocated pages that only
|
||||
contain zeroes) should be treated specially. When set to 1,
|
||||
empty pages are merged with the kernel zero page(s) instead of
|
||||
with each other as it would happen normally. This can improve
|
||||
the performance on architectures with coloured zero pages,
|
||||
depending on the workload. Care should be taken when enabling
|
||||
this setting, as it can potentially degrade the performance of
|
||||
KSM for some workloads, for example if the checksums of pages
|
||||
candidate for merging match the checksum of an empty
|
||||
page. This setting can be changed at any time, it is only
|
||||
effective for pages merged after the change.
|
||||
|
||||
Default: 0 (normal KSM behaviour as in earlier releases)
|
||||
|
||||
max_page_sharing
|
||||
Maximum sharing allowed for each KSM page. This enforces a
|
||||
deduplication limit to avoid high latency for virtual memory
|
||||
operations that involve traversal of the virtual mappings that
|
||||
share the KSM page. The minimum value is 2 as a newly created
|
||||
KSM page will have at least two sharers. The higher this value
|
||||
the faster KSM will merge the memory and the higher the
|
||||
deduplication factor will be, but the slower the worst case
|
||||
virtual mappings traversal could be for any given KSM
|
||||
page. Slowing down this traversal means there will be higher
|
||||
latency for certain virtual memory operations happening during
|
||||
swapping, compaction, NUMA balancing and page migration, in
|
||||
turn decreasing responsiveness for the caller of those virtual
|
||||
memory operations. The scheduler latency of other tasks not
|
||||
involved with the VM operations doing the virtual mappings
|
||||
traversal is not affected by this parameter as these
|
||||
traversals are always schedule friendly themselves.
|
||||
|
||||
stable_node_chains_prune_millisecs
|
||||
specifies how frequently KSM checks the metadata of the pages
|
||||
that hit the deduplication limit for stale information.
|
||||
Smaller milllisecs values will free up the KSM metadata with
|
||||
lower latency, but they will make ksmd use more CPU during the
|
||||
scan. It's a noop if not a single KSM page hit the
|
||||
``max_page_sharing`` yet.
|
||||
|
||||
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
pages_shared
|
||||
how many shared pages are being used
|
||||
pages_sharing
|
||||
how many more sites are sharing them i.e. how much saved
|
||||
pages_unshared
|
||||
how many pages unique but repeatedly checked for merging
|
||||
pages_volatile
|
||||
how many pages changing too fast to be placed in a tree
|
||||
full_scans
|
||||
how many times all mergeable areas have been scanned
|
||||
stable_node_chains
|
||||
the number of KSM pages that hit the ``max_page_sharing`` limit
|
||||
stable_node_dups
|
||||
number of duplicated KSM pages
|
||||
|
||||
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
|
||||
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
|
||||
indicates wasted effort. ``pages_volatile`` embraces several
|
||||
different kinds of activity, but a high proportion there would also
|
||||
indicate poor use of madvise MADV_MERGEABLE.
|
||||
|
||||
The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
|
||||
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
|
||||
be increased accordingly.
|
||||
|
||||
--
|
||||
Izik Eidus,
|
||||
Hugh Dickins, 17 Nov 2009
|
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
495
Documentation/admin-guide/mm/numa_memory_policy.rst
Normal file
@ -0,0 +1,495 @@
|
||||
.. _numa_memory_policy:
|
||||
|
||||
==================
|
||||
NUMA Memory Policy
|
||||
==================
|
||||
|
||||
What is NUMA Memory Policy?
|
||||
============================
|
||||
|
||||
In the Linux kernel, "memory policy" determines from which node the kernel will
|
||||
allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
||||
supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
||||
The current memory policy support was added to Linux 2.6 around May 2004. This
|
||||
document attempts to describe the concepts and APIs of the 2.6 memory policy
|
||||
support.
|
||||
|
||||
Memory policies should not be confused with cpusets
|
||||
(``Documentation/cgroup-v1/cpusets.txt``)
|
||||
which is an administrative mechanism for restricting the nodes from which
|
||||
memory may be allocated by a set of processes. Memory policies are a
|
||||
programming interface that a NUMA-aware application can take advantage of. When
|
||||
both cpusets and policies are applied to a task, the restrictions of the cpuset
|
||||
takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
|
||||
below for more details.
|
||||
|
||||
Memory Policy Concepts
|
||||
======================
|
||||
|
||||
Scope of Memory Policies
|
||||
------------------------
|
||||
|
||||
The Linux kernel supports _scopes_ of memory policy, described here from
|
||||
most general to most specific:
|
||||
|
||||
System Default Policy
|
||||
this policy is "hard coded" into the kernel. It is the policy
|
||||
that governs all page allocations that aren't controlled by
|
||||
one of the more specific policy scopes discussed below. When
|
||||
the system is "up and running", the system default policy will
|
||||
use "local allocation" described below. However, during boot
|
||||
up, the system default policy will be set to interleave
|
||||
allocations across all nodes with "sufficient" memory, so as
|
||||
not to overload the initial boot node with boot-time
|
||||
allocations.
|
||||
|
||||
Task/Process Policy
|
||||
this is an optional, per-task policy. When defined for a
|
||||
specific task, this policy controls all page allocations made
|
||||
by or on behalf of the task that aren't controlled by a more
|
||||
specific scope. If a task does not define a task policy, then
|
||||
all page allocations that would have been controlled by the
|
||||
task policy "fall back" to the System Default Policy.
|
||||
|
||||
The task policy applies to the entire address space of a task. Thus,
|
||||
it is inheritable, and indeed is inherited, across both fork()
|
||||
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
||||
to establish the task policy for a child task exec()'d from an
|
||||
executable image that has no awareness of memory policy. See the
|
||||
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||
below, for an overview of the system call
|
||||
that a task may use to set/change its task/process policy.
|
||||
|
||||
In a multi-threaded task, task policies apply only to the thread
|
||||
[Linux kernel task] that installs the policy and any threads
|
||||
subsequently created by that thread. Any sibling threads existing
|
||||
at the time a new task policy is installed retain their current
|
||||
policy.
|
||||
|
||||
A task policy applies only to pages allocated after the policy is
|
||||
installed. Any pages already faulted in by the task when the task
|
||||
changes its task policy remain where they were allocated based on
|
||||
the policy at the time they were allocated.
|
||||
|
||||
.. _vma_policy:
|
||||
|
||||
VMA Policy
|
||||
A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
||||
virtual address space. A task may define a specific policy for a range
|
||||
of its virtual address space. See the
|
||||
:ref:`Memory Policy APIs <memory_policy_apis>` section,
|
||||
below, for an overview of the mbind() system call used to set a VMA
|
||||
policy.
|
||||
|
||||
A VMA policy will govern the allocation of pages that back
|
||||
this region of the address space. Any regions of the task's
|
||||
address space that don't have an explicit VMA policy will fall
|
||||
back to the task policy, which may itself fall back to the
|
||||
System Default Policy.
|
||||
|
||||
VMA policies have a few complicating details:
|
||||
|
||||
* VMA policy applies ONLY to anonymous pages. These include
|
||||
pages allocated for anonymous segments, such as the task
|
||||
stack and heap, and any regions of the address space
|
||||
mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
|
||||
applied to a file mapping, it will be ignored if the mapping
|
||||
used the MAP_SHARED flag. If the file mapping used the
|
||||
MAP_PRIVATE flag, the VMA policy will only be applied when
|
||||
an anonymous page is allocated on an attempt to write to the
|
||||
mapping-- i.e., at Copy-On-Write.
|
||||
|
||||
* VMA policies are shared between all tasks that share a
|
||||
virtual address space--a.k.a. threads--independent of when
|
||||
the policy is installed; and they are inherited across
|
||||
fork(). However, because VMA policies refer to a specific
|
||||
region of a task's address space, and because the address
|
||||
space is discarded and recreated on exec*(), VMA policies
|
||||
are NOT inheritable across exec(). Thus, only NUMA-aware
|
||||
applications may use VMA policies.
|
||||
|
||||
* A task may install a new VMA policy on a sub-range of a
|
||||
previously mmap()ed region. When this happens, Linux splits
|
||||
the existing virtual memory area into 2 or 3 VMAs, each with
|
||||
it's own policy.
|
||||
|
||||
* By default, VMA policy applies only to pages allocated after
|
||||
the policy is installed. Any pages already faulted into the
|
||||
VMA range remain where they were allocated based on the
|
||||
policy at the time they were allocated. However, since
|
||||
2.6.16, Linux supports page migration via the mbind() system
|
||||
call, so that page contents can be moved to match a newly
|
||||
installed policy.
|
||||
|
||||
Shared Policy
|
||||
Conceptually, shared policies apply to "memory objects" mapped
|
||||
shared into one or more tasks' distinct address spaces. An
|
||||
application installs shared policies the same way as VMA
|
||||
policies--using the mbind() system call specifying a range of
|
||||
virtual addresses that map the shared object. However, unlike
|
||||
VMA policies, which can be considered to be an attribute of a
|
||||
range of a task's address space, shared policies apply
|
||||
directly to the shared object. Thus, all tasks that attach to
|
||||
the object share the policy, and all pages allocated for the
|
||||
shared object, by any task, will obey the shared policy.
|
||||
|
||||
As of 2.6.22, only shared memory segments, created by shmget() or
|
||||
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
||||
policy support was added to Linux, the associated data structures were
|
||||
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
||||
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
||||
shmem segments were never "hooked up" to the shared policy support.
|
||||
Although hugetlbfs segments now support lazy allocation, their support
|
||||
for shared policy has not been completed.
|
||||
|
||||
As mentioned above in :ref:`VMA policies <vma_policy>` section,
|
||||
allocations of page cache pages for regular files mmap()ed
|
||||
with MAP_SHARED ignore any VMA policy installed on the virtual
|
||||
address range backed by the shared file mapping. Rather,
|
||||
shared page cache pages, including pages backing private
|
||||
mappings that have not yet been written by the task, follow
|
||||
task policy, if any, else System Default Policy.
|
||||
|
||||
The shared policy infrastructure supports different policies on subset
|
||||
ranges of the shared object. However, Linux still splits the VMA of
|
||||
the task that installs the policy for each range of distinct policy.
|
||||
Thus, different tasks that attach to a shared memory segment can have
|
||||
different VMA configurations mapping that one shared object. This
|
||||
can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
|
||||
a shared memory region, when one task has installed shared policy on
|
||||
one or more ranges of the region.
|
||||
|
||||
Components of Memory Policies
|
||||
-----------------------------
|
||||
|
||||
A NUMA memory policy consists of a "mode", optional mode flags, and
|
||||
an optional set of nodes. The mode determines the behavior of the
|
||||
policy, the optional mode flags determine the behavior of the mode,
|
||||
and the optional set of nodes can be viewed as the arguments to the
|
||||
policy behavior.
|
||||
|
||||
Internally, memory policies are implemented by a reference counted
|
||||
structure, struct mempolicy. Details of this structure will be
|
||||
discussed in context, below, as required to explain the behavior.
|
||||
|
||||
NUMA memory policy supports the following 4 behavioral modes:
|
||||
|
||||
Default Mode--MPOL_DEFAULT
|
||||
This mode is only used in the memory policy APIs. Internally,
|
||||
MPOL_DEFAULT is converted to the NULL memory policy in all
|
||||
policy scopes. Any existing non-default policy will simply be
|
||||
removed when MPOL_DEFAULT is specified. As a result,
|
||||
MPOL_DEFAULT means "fall back to the next most specific policy
|
||||
scope."
|
||||
|
||||
For example, a NULL or default task policy will fall back to the
|
||||
system default policy. A NULL or default vma policy will fall
|
||||
back to the task policy.
|
||||
|
||||
When specified in one of the memory policy APIs, the Default mode
|
||||
does not use the optional set of nodes.
|
||||
|
||||
It is an error for the set of nodes specified for this policy to
|
||||
be non-empty.
|
||||
|
||||
MPOL_BIND
|
||||
This mode specifies that memory must come from the set of
|
||||
nodes specified by the policy. Memory will be allocated from
|
||||
the node in the set with sufficient free memory that is
|
||||
closest to the node where the allocation takes place.
|
||||
|
||||
MPOL_PREFERRED
|
||||
This mode specifies that the allocation should be attempted
|
||||
from the single node specified in the policy. If that
|
||||
allocation fails, the kernel will search other nodes, in order
|
||||
of increasing distance from the preferred node based on
|
||||
information provided by the platform firmware.
|
||||
|
||||
Internally, the Preferred policy uses a single node--the
|
||||
preferred_node member of struct mempolicy. When the internal
|
||||
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
|
||||
and the policy is interpreted as local allocation. "Local"
|
||||
allocation policy can be viewed as a Preferred policy that
|
||||
starts at the node containing the cpu where the allocation
|
||||
takes place.
|
||||
|
||||
It is possible for the user to specify that local allocation
|
||||
is always preferred by passing an empty nodemask with this
|
||||
mode. If an empty nodemask is passed, the policy cannot use
|
||||
the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
|
||||
described below.
|
||||
|
||||
MPOL_INTERLEAVED
|
||||
This mode specifies that page allocations be interleaved, on a
|
||||
page granularity, across the nodes specified in the policy.
|
||||
This mode also behaves slightly differently, based on the
|
||||
context where it is used:
|
||||
|
||||
For allocation of anonymous pages and shared memory pages,
|
||||
Interleave mode indexes the set of nodes specified by the
|
||||
policy using the page offset of the faulting address into the
|
||||
segment [VMA] containing the address modulo the number of
|
||||
nodes specified by the policy. It then attempts to allocate a
|
||||
page, starting at the selected node, as if the node had been
|
||||
specified by a Preferred policy or had been selected by a
|
||||
local allocation. That is, allocation will follow the per
|
||||
node zonelist.
|
||||
|
||||
For allocation of page cache pages, Interleave mode indexes
|
||||
the set of nodes specified by the policy using a node counter
|
||||
maintained per task. This counter wraps around to the lowest
|
||||
specified node after it reaches the highest specified node.
|
||||
This will tend to spread the pages out over the nodes
|
||||
specified by the policy based on the order in which they are
|
||||
allocated, rather than based on any page offset into an
|
||||
address range or file. During system boot up, the temporary
|
||||
interleaved system default policy works in this mode.
|
||||
|
||||
NUMA memory policy supports the following optional mode flags:
|
||||
|
||||
MPOL_F_STATIC_NODES
|
||||
This flag specifies that the nodemask passed by
|
||||
the user should not be remapped if the task or VMA's set of allowed
|
||||
nodes changes after the memory policy has been defined.
|
||||
|
||||
Without this flag, any time a mempolicy is rebound because of a
|
||||
change in the set of allowed nodes, the node (Preferred) or
|
||||
nodemask (Bind, Interleave) is remapped to the new set of
|
||||
allowed nodes. This may result in nodes being used that were
|
||||
previously undesired.
|
||||
|
||||
With this flag, if the user-specified nodes overlap with the
|
||||
nodes allowed by the task's cpuset, then the memory policy is
|
||||
applied to their intersection. If the two sets of nodes do not
|
||||
overlap, the Default policy is used.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 1-3 that sets an Interleave policy over the same set. If
|
||||
the cpuset's mems change to 3-5, the Interleave will now occur
|
||||
over nodes 3, 4, and 5. With this flag, however, since only node
|
||||
3 is allowed from the user's nodemask, the "interleave" only
|
||||
occurs over that node. If no nodes from the user's nodemask are
|
||||
now allowed, the Default behavior is used.
|
||||
|
||||
MPOL_F_STATIC_NODES cannot be combined with the
|
||||
MPOL_F_RELATIVE_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
MPOL_F_RELATIVE_NODES
|
||||
This flag specifies that the nodemask passed
|
||||
by the user will be mapped relative to the set of the task or VMA's
|
||||
set of allowed nodes. The kernel stores the user-passed nodemask,
|
||||
and if the allowed nodes changes, then that original nodemask will
|
||||
be remapped relative to the new set of allowed nodes.
|
||||
|
||||
Without this flag (and without MPOL_F_STATIC_NODES), anytime a
|
||||
mempolicy is rebound because of a change in the set of allowed
|
||||
nodes, the node (Preferred) or nodemask (Bind, Interleave) is
|
||||
remapped to the new set of allowed nodes. That remap may not
|
||||
preserve the relative nature of the user's passed nodemask to its
|
||||
set of allowed nodes upon successive rebinds: a nodemask of
|
||||
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
|
||||
allowed nodes is restored to its original state.
|
||||
|
||||
With this flag, the remap is done so that the node numbers from
|
||||
the user's passed nodemask are relative to the set of allowed
|
||||
nodes. In other words, if nodes 0, 2, and 4 are set in the user's
|
||||
nodemask, the policy will be effected over the first (and in the
|
||||
Bind or Interleave case, the third and fifth) nodes in the set of
|
||||
allowed nodes. The nodemask passed by the user represents nodes
|
||||
relative to task or VMA's set of allowed nodes.
|
||||
|
||||
If the user's nodemask includes nodes that are outside the range
|
||||
of the new set of allowed nodes (for example, node 5 is set in
|
||||
the user's nodemask when the set of allowed nodes is only 0-3),
|
||||
then the remap wraps around to the beginning of the nodemask and,
|
||||
if not already set, sets the node in the mempolicy nodemask.
|
||||
|
||||
For example, consider a task that is attached to a cpuset with
|
||||
mems 2-5 that sets an Interleave policy over the same set with
|
||||
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
|
||||
interleave now occurs over nodes 3,5-7. If the cpuset's mems
|
||||
then change to 0,2-3,5, then the interleave occurs over nodes
|
||||
0,2-3,5.
|
||||
|
||||
Thanks to the consistent remapping, applications preparing
|
||||
nodemasks to specify memory policies using this flag should
|
||||
disregard their current, actual cpuset imposed memory placement
|
||||
and prepare the nodemask as if they were always located on
|
||||
memory nodes 0 to N-1, where N is the number of memory nodes the
|
||||
policy is intended to manage. Let the kernel then remap to the
|
||||
set of memory nodes allowed by the task's cpuset, as that may
|
||||
change over time.
|
||||
|
||||
MPOL_F_RELATIVE_NODES cannot be combined with the
|
||||
MPOL_F_STATIC_NODES flag. It also cannot be used for
|
||||
MPOL_PREFERRED policies that were created with an empty nodemask
|
||||
(local allocation).
|
||||
|
||||
Memory Policy Reference Counting
|
||||
================================
|
||||
|
||||
To resolve use/free races, struct mempolicy contains an atomic reference
|
||||
count field. Internal interfaces, mpol_get()/mpol_put() increment and
|
||||
decrement this reference count, respectively. mpol_put() will only free
|
||||
the structure back to the mempolicy kmem cache when the reference count
|
||||
goes to zero.
|
||||
|
||||
When a new memory policy is allocated, its reference count is initialized
|
||||
to '1', representing the reference held by the task that is installing the
|
||||
new policy. When a pointer to a memory policy structure is stored in another
|
||||
structure, another reference is added, as the task's reference will be dropped
|
||||
on completion of the policy installation.
|
||||
|
||||
During run-time "usage" of the policy, we attempt to minimize atomic operations
|
||||
on the reference count, as this can lead to cache lines bouncing between cpus
|
||||
and NUMA nodes. "Usage" here means one of the following:
|
||||
|
||||
1) querying of the policy, either by the task itself [using the get_mempolicy()
|
||||
API discussed below] or by another task using the /proc/<pid>/numa_maps
|
||||
interface.
|
||||
|
||||
2) examination of the policy to determine the policy mode and associated node
|
||||
or node lists, if any, for page allocation. This is considered a "hot
|
||||
path". Note that for MPOL_BIND, the "usage" extends across the entire
|
||||
allocation process, which may sleep during page reclaimation, because the
|
||||
BIND policy nodemask is used, by reference, to filter ineligible nodes.
|
||||
|
||||
We can avoid taking an extra reference during the usages listed above as
|
||||
follows:
|
||||
|
||||
1) we never need to get/free the system default policy as this is never
|
||||
changed nor freed, once the system is up and running.
|
||||
|
||||
2) for querying the policy, we do not need to take an extra reference on the
|
||||
target task's task policy nor vma policies because we always acquire the
|
||||
task's mm's mmap_sem for read during the query. The set_mempolicy() and
|
||||
mbind() APIs [see below] always acquire the mmap_sem for write when
|
||||
installing or replacing task or vma policies. Thus, there is no possibility
|
||||
of a task or thread freeing a policy while another task or thread is
|
||||
querying it.
|
||||
|
||||
3) Page allocation usage of task or vma policy occurs in the fault path where
|
||||
we hold them mmap_sem for read. Again, because replacing the task or vma
|
||||
policy requires that the mmap_sem be held for write, the policy can't be
|
||||
freed out from under us while we're using it for page allocation.
|
||||
|
||||
4) Shared policies require special consideration. One task can replace a
|
||||
shared memory policy while another task, with a distinct mmap_sem, is
|
||||
querying or allocating a page based on the policy. To resolve this
|
||||
potential race, the shared policy infrastructure adds an extra reference
|
||||
to the shared policy during lookup while holding a spin lock on the shared
|
||||
policy management structure. This requires that we drop this extra
|
||||
reference when we're finished "using" the policy. We must drop the
|
||||
extra reference on shared policies in the same query/allocation paths
|
||||
used for non-shared policies. For this reason, shared policies are marked
|
||||
as such, and the extra reference is dropped "conditionally"--i.e., only
|
||||
for shared policies.
|
||||
|
||||
Because of this extra reference counting, and because we must lookup
|
||||
shared policies in a tree structure under spinlock, shared policies are
|
||||
more expensive to use in the page allocation path. This is especially
|
||||
true for shared policies on shared memory regions shared by tasks running
|
||||
on different NUMA nodes. This extra overhead can be avoided by always
|
||||
falling back to task or system default policy for shared memory regions,
|
||||
or by prefaulting the entire shared memory region into memory and locking
|
||||
it down. However, this might not be appropriate for all applications.
|
||||
|
||||
.. _memory_policy_apis:
|
||||
|
||||
Memory Policy APIs
|
||||
==================
|
||||
|
||||
Linux supports 3 system calls for controlling memory policy. These APIS
|
||||
always affect only the calling task, the calling task's address space, or
|
||||
some shared object mapped into the calling task's address space.
|
||||
|
||||
.. note::
|
||||
the headers that define these APIs and the parameter data types for
|
||||
user space applications reside in a package that is not part of the
|
||||
Linux kernel. The kernel system call interfaces, with the 'sys\_'
|
||||
prefix, are defined in <linux/syscalls.h>; the mode and flag
|
||||
definitions are defined in <linux/mempolicy.h>.
|
||||
|
||||
Set [Task] Memory Policy::
|
||||
|
||||
long set_mempolicy(int mode, const unsigned long *nmask,
|
||||
unsigned long maxnode);
|
||||
|
||||
Set's the calling task's "task/process memory policy" to mode
|
||||
specified by the 'mode' argument and the set of nodes defined by
|
||||
'nmask'. 'nmask' points to a bit mask of node ids containing at least
|
||||
'maxnode' ids. Optional mode flags may be passed by combining the
|
||||
'mode' argument with the flag (for example: MPOL_INTERLEAVE |
|
||||
MPOL_F_STATIC_NODES).
|
||||
|
||||
See the set_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Get [Task] Memory Policy or Related Information::
|
||||
|
||||
long get_mempolicy(int *mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
void *addr, int flags);
|
||||
|
||||
Queries the "task/process memory policy" of the calling task, or the
|
||||
policy or location of a specified virtual address, depending on the
|
||||
'flags' argument.
|
||||
|
||||
See the get_mempolicy(2) man page for more details
|
||||
|
||||
|
||||
Install VMA/Shared Policy for a Range of Task's Address Space::
|
||||
|
||||
long mbind(void *start, unsigned long len, int mode,
|
||||
const unsigned long *nmask, unsigned long maxnode,
|
||||
unsigned flags);
|
||||
|
||||
mbind() installs the policy specified by (mode, nmask, maxnodes) as a
|
||||
VMA policy for the range of the calling task's address space specified
|
||||
by the 'start' and 'len' arguments. Additional actions may be
|
||||
requested via the 'flags' argument.
|
||||
|
||||
See the mbind(2) man page for more details.
|
||||
|
||||
Memory Policy Command Line Interface
|
||||
====================================
|
||||
|
||||
Although not strictly part of the Linux implementation of memory policy,
|
||||
a command line tool, numactl(8), exists that allows one to:
|
||||
|
||||
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
||||
exec(2)
|
||||
|
||||
+ set the shared policy for a shared memory segment via mbind(2)
|
||||
|
||||
The numactl(8) tool is packaged with the run-time version of the library
|
||||
containing the memory policy system call wrappers. Some distributions
|
||||
package the headers and compile-time libraries in a separate development
|
||||
package.
|
||||
|
||||
.. _mem_pol_and_cpusets:
|
||||
|
||||
Memory Policies and cpusets
|
||||
===========================
|
||||
|
||||
Memory policies work within cpusets as described above. For memory policies
|
||||
that require a node or set of nodes, the nodes are restricted to the set of
|
||||
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
||||
specified for the policy contains nodes that are not allowed by the cpuset and
|
||||
MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
|
||||
specified for the policy and the set of nodes with memory is used. If the
|
||||
result is the empty set, the policy is considered invalid and cannot be
|
||||
installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
|
||||
onto and folded into the task's set of allowed nodes as previously described.
|
||||
|
||||
The interaction of memory policies and cpusets can be problematic when tasks
|
||||
in two cpusets share access to a memory region, such as shared memory segments
|
||||
created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
|
||||
any of the tasks install shared policy on the region, only nodes whose
|
||||
memories are allowed in both cpusets may be used in the policies. Obtaining
|
||||
this information requires "stepping outside" the memory policy APIs to use the
|
||||
cpuset information and requires that one know in what cpusets other task might
|
||||
be attaching to the shared region. Furthermore, if the cpusets' allowed
|
||||
memory sets are disjoint, "local" allocation is the only valid policy.
|
@ -1,21 +1,25 @@
|
||||
pagemap, from the userspace perspective
|
||||
---------------------------------------
|
||||
.. _pagemap:
|
||||
|
||||
=============================
|
||||
Examining Process Page Tables
|
||||
=============================
|
||||
|
||||
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
||||
userspace programs to examine the page tables and related information by
|
||||
reading files in /proc.
|
||||
reading files in ``/proc``.
|
||||
|
||||
There are four components to pagemap:
|
||||
|
||||
* /proc/pid/pagemap. This file lets a userspace process find out which
|
||||
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
|
||||
physical frame each virtual page is mapped to. It contains one 64-bit
|
||||
value for each virtual page, containing the following data (from
|
||||
fs/proc/task_mmu.c, above pagemap_read):
|
||||
``fs/proc/task_mmu.c``, above pagemap_read):
|
||||
|
||||
* Bits 0-54 page frame number (PFN) if present
|
||||
* Bits 0-4 swap type if swapped
|
||||
* Bits 5-54 swap offset if swapped
|
||||
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
|
||||
* Bit 55 pte is soft-dirty (see
|
||||
:ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
|
||||
* Bit 56 page exclusively mapped (since 4.2)
|
||||
* Bits 57-60 zero
|
||||
* Bit 61 page is file-page or shared-anon (since 3.5)
|
||||
@ -33,28 +37,28 @@ There are four components to pagemap:
|
||||
precisely which pages are mapped (or in swap) and comparing mapped
|
||||
pages between processes.
|
||||
|
||||
Efficient users of this interface will use /proc/pid/maps to
|
||||
Efficient users of this interface will use ``/proc/pid/maps`` to
|
||||
determine which areas of memory are actually mapped and llseek to
|
||||
skip over unmapped regions.
|
||||
|
||||
* /proc/kpagecount. This file contains a 64-bit count of the number of
|
||||
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
||||
times each page is mapped, indexed by PFN.
|
||||
|
||||
* /proc/kpageflags. This file contains a 64-bit set of flags for each
|
||||
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
||||
page, indexed by PFN.
|
||||
|
||||
The flags are (from fs/proc/page.c, above kpageflags_read):
|
||||
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
|
||||
|
||||
0. LOCKED
|
||||
1. ERROR
|
||||
2. REFERENCED
|
||||
3. UPTODATE
|
||||
4. DIRTY
|
||||
5. LRU
|
||||
6. ACTIVE
|
||||
7. SLAB
|
||||
8. WRITEBACK
|
||||
9. RECLAIM
|
||||
0. LOCKED
|
||||
1. ERROR
|
||||
2. REFERENCED
|
||||
3. UPTODATE
|
||||
4. DIRTY
|
||||
5. LRU
|
||||
6. ACTIVE
|
||||
7. SLAB
|
||||
8. WRITEBACK
|
||||
9. RECLAIM
|
||||
10. BUDDY
|
||||
11. MMAP
|
||||
12. ANON
|
||||
@ -72,98 +76,111 @@ There are four components to pagemap:
|
||||
24. ZERO_PAGE
|
||||
25. IDLE
|
||||
|
||||
* /proc/kpagecgroup. This file contains a 64-bit inode number of the
|
||||
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
|
||||
memory cgroup each page is charged to, indexed by PFN. Only available when
|
||||
CONFIG_MEMCG is set.
|
||||
|
||||
Short descriptions to the page flags:
|
||||
Short descriptions to the page flags
|
||||
====================================
|
||||
|
||||
0. LOCKED
|
||||
page is being locked for exclusive access, eg. by undergoing read/write IO
|
||||
|
||||
7. SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
|
||||
10. BUDDY
|
||||
0 - LOCKED
|
||||
page is being locked for exclusive access, e.g. by undergoing read/write IO
|
||||
7 - SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
10 - BUDDY
|
||||
a free memory block managed by the buddy system allocator
|
||||
The buddy system organizes free memory in blocks of various orders.
|
||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||
set for and _only_ for the first page.
|
||||
|
||||
15. COMPOUND_HEAD
|
||||
16. COMPOUND_TAIL
|
||||
15 - COMPOUND_HEAD
|
||||
A compound page with order N consists of 2^N physically contiguous pages.
|
||||
A compound page with order 2 takes the form of "HTTT", where H donates its
|
||||
head page and T donates its tail page(s). The major consumers of compound
|
||||
pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
|
||||
memory allocators and various device drivers. However in this interface,
|
||||
only huge/giga pages are made visible to end users.
|
||||
17. HUGE
|
||||
pages are hugeTLB pages
|
||||
(:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
|
||||
the SLUB etc. memory allocators and various device drivers.
|
||||
However in this interface, only huge/giga pages are made visible
|
||||
to end users.
|
||||
16 - COMPOUND_TAIL
|
||||
A compound page tail (see description above).
|
||||
17 - HUGE
|
||||
this is an integral part of a HugeTLB page
|
||||
|
||||
19. HWPOISON
|
||||
19 - HWPOISON
|
||||
hardware detected memory corruption on this page: don't touch the data!
|
||||
|
||||
20. NOPAGE
|
||||
20 - NOPAGE
|
||||
no page frame exists at the requested address
|
||||
|
||||
21. KSM
|
||||
21 - KSM
|
||||
identical memory pages dynamically shared between one or more processes
|
||||
|
||||
22. THP
|
||||
22 - THP
|
||||
contiguous pages which construct transparent hugepages
|
||||
|
||||
23. BALLOON
|
||||
23 - BALLOON
|
||||
balloon compaction page
|
||||
|
||||
24. ZERO_PAGE
|
||||
24 - ZERO_PAGE
|
||||
zero page for pfn_zero or huge_zero page
|
||||
|
||||
25. IDLE
|
||||
25 - IDLE
|
||||
page has not been accessed since it was marked idle (see
|
||||
Documentation/vm/idle_page_tracking.txt). Note that this flag may be
|
||||
stale in case the page was accessed via a PTE. To make sure the flag
|
||||
is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first.
|
||||
:ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
|
||||
Note that this flag may be stale in case the page was accessed via
|
||||
a PTE. To make sure the flag is up-to-date one has to read
|
||||
``/sys/kernel/mm/page_idle/bitmap`` first.
|
||||
|
||||
[IO related page flags]
|
||||
1. ERROR IO error occurred
|
||||
3. UPTODATE page has up-to-date data
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4. DIRTY page has been written to, hence contains new data
|
||||
ie. for file backed page: (in-memory data revision > on-disk one)
|
||||
8. WRITEBACK page is being synced to disk
|
||||
IO related page flags
|
||||
---------------------
|
||||
|
||||
[LRU related page flags]
|
||||
5. LRU page is in one of the LRU lists
|
||||
6. ACTIVE page is in the active LRU list
|
||||
18. UNEVICTABLE page is in the unevictable (non-)LRU list
|
||||
It is somehow pinned and not a candidate for LRU page reclaims,
|
||||
eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
|
||||
2. REFERENCED page has been referenced since last LRU list enqueue/requeue
|
||||
9. RECLAIM page will be reclaimed soon after its pageout IO completed
|
||||
11. MMAP a memory mapped page
|
||||
12. ANON a memory mapped page that is not part of a file
|
||||
13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
|
||||
14. SWAPBACKED page is backed by swap/RAM
|
||||
1 - ERROR
|
||||
IO error occurred
|
||||
3 - UPTODATE
|
||||
page has up-to-date data
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4 - DIRTY
|
||||
page has been written to, hence contains new data
|
||||
i.e. for file backed page: (in-memory data revision > on-disk one)
|
||||
8 - WRITEBACK
|
||||
page is being synced to disk
|
||||
|
||||
LRU related page flags
|
||||
----------------------
|
||||
|
||||
5 - LRU
|
||||
page is in one of the LRU lists
|
||||
6 - ACTIVE
|
||||
page is in the active LRU list
|
||||
18 - UNEVICTABLE
|
||||
page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
||||
shmctl(SHM_LOCK) and mlock() memory segments
|
||||
2 - REFERENCED
|
||||
page has been referenced since last LRU list enqueue/requeue
|
||||
9 - RECLAIM
|
||||
page will be reclaimed soon after its pageout IO completed
|
||||
11 - MMAP
|
||||
a memory mapped page
|
||||
12 - ANON
|
||||
a memory mapped page that is not part of a file
|
||||
13 - SWAPCACHE
|
||||
page is mapped to swap space, i.e. has an associated swap entry
|
||||
14 - SWAPBACKED
|
||||
page is backed by swap/RAM
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
above flags.
|
||||
|
||||
Using pagemap to do something useful:
|
||||
Using pagemap to do something useful
|
||||
====================================
|
||||
|
||||
The general procedure for using pagemap to find out about a process' memory
|
||||
usage goes like this:
|
||||
|
||||
1. Read /proc/pid/maps to determine which parts of the memory space are
|
||||
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
|
||||
mapped to what.
|
||||
2. Select the maps you are interested in -- all of them, or a particular
|
||||
library, or the stack or the heap, etc.
|
||||
3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
|
||||
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
|
||||
4. Read a u64 for each page from pagemap.
|
||||
5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
|
||||
read, seek to that entry in the file, and read the data you want.
|
||||
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
|
||||
just read, seek to that entry in the file, and read the data you want.
|
||||
|
||||
For example, to find the "unique set size" (USS), which is the amount of
|
||||
memory that a process is using that is not shared with any other process,
|
||||
@ -171,7 +188,8 @@ you can go through every map in the process, find the PFNs, look those up
|
||||
in kpagecount, and tally up the number of pages that are only referenced
|
||||
once.
|
||||
|
||||
Other notes:
|
||||
Other notes
|
||||
===========
|
||||
|
||||
Reading from any of the files will return -EINVAL if you are not starting
|
||||
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
@ -1,34 +1,38 @@
|
||||
SOFT-DIRTY PTEs
|
||||
.. _soft_dirty:
|
||||
|
||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||
===============
|
||||
Soft-Dirty PTEs
|
||||
===============
|
||||
|
||||
The soft-dirty is a bit on a PTE which helps to track which pages a task
|
||||
writes to. In order to do this tracking one should
|
||||
|
||||
1. Clear soft-dirty bits from the task's PTEs.
|
||||
|
||||
This is done by writing "4" into the /proc/PID/clear_refs file of the
|
||||
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
|
||||
task in question.
|
||||
|
||||
2. Wait some time.
|
||||
|
||||
3. Read soft-dirty bits from the PTEs.
|
||||
|
||||
This is done by reading from the /proc/PID/pagemap. The bit 55 of the
|
||||
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
|
||||
64-bit qword is the soft-dirty one. If set, the respective PTE was
|
||||
written to since step 1.
|
||||
|
||||
|
||||
Internally, to do this tracking, the writable bit is cleared from PTEs
|
||||
Internally, to do this tracking, the writable bit is cleared from PTEs
|
||||
when the soft-dirty bit is cleared. So, after this, when the task tries to
|
||||
modify a page at some virtual address the #PF occurs and the kernel sets
|
||||
the soft-dirty bit on the respective PTE.
|
||||
|
||||
Note, that although all the task's address space is marked as r/o after the
|
||||
Note, that although all the task's address space is marked as r/o after the
|
||||
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
|
||||
This is so, since the pages are still mapped to physical memory, and thus all
|
||||
the kernel does is finds this fact out and puts both writable and soft-dirty
|
||||
bits on the PTE.
|
||||
|
||||
While in most cases tracking memory changes by #PF-s is more than enough
|
||||
While in most cases tracking memory changes by #PF-s is more than enough
|
||||
there is still a scenario when we can lose soft dirty bits -- a task
|
||||
unmaps a previously mapped memory region and then maps a new one at exactly
|
||||
the same place. When unmap is called, the kernel internally clears PTE values
|
||||
@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such
|
||||
memory region renewal the kernel always marks new memory regions (and
|
||||
expanded regions) as soft dirty.
|
||||
|
||||
This feature is actively used by the checkpoint-restore project. You
|
||||
This feature is actively used by the checkpoint-restore project. You
|
||||
can find more details about it on http://criu.org
|
||||
|
||||
|
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
418
Documentation/admin-guide/mm/transhuge.rst
Normal file
@ -0,0 +1,418 @@
|
||||
.. _admin_guide_transhuge:
|
||||
|
||||
============================
|
||||
Transparent Hugepage Support
|
||||
============================
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Performance critical computing applications dealing with large memory
|
||||
working sets are already running on top of libhugetlbfs and in turn
|
||||
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
||||
using huge pages for the backing of virtual memory with huge pages
|
||||
that supports the automatic promotion and demotion of page sizes and
|
||||
without the shortcomings of hugetlbfs.
|
||||
|
||||
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
||||
But in the future it can expand to other filesystems.
|
||||
|
||||
.. note::
|
||||
in the examples below we presume that the basic page size is 4K and
|
||||
the huge page size is 2M, although the actual numbers may vary
|
||||
depending on the CPU architecture.
|
||||
|
||||
The reason applications are running faster is because of two
|
||||
factors. The first factor is almost completely irrelevant and it's not
|
||||
of significant interest because it'll also have the downside of
|
||||
requiring larger clear-page copy-page in page faults which is a
|
||||
potentially negative effect. The first factor consists in taking a
|
||||
single page fault for each 2M virtual region touched by userland (so
|
||||
reducing the enter/exit kernel frequency by a 512 times factor). This
|
||||
only matters the first time the memory is accessed for the lifetime of
|
||||
a memory mapping. The second long lasting and much more important
|
||||
factor will affect all subsequent accesses to the memory for the whole
|
||||
runtime of the application. The second factor consist of two
|
||||
components:
|
||||
|
||||
1) the TLB miss will run faster (especially with virtualization using
|
||||
nested pagetables but almost always also on bare metal without
|
||||
virtualization)
|
||||
|
||||
2) a single TLB entry will be mapping a much larger amount of virtual
|
||||
memory in turn reducing the number of TLB misses. With
|
||||
virtualization and nested pagetables the TLB can be mapped of
|
||||
larger size only if both KVM and the Linux guest are using
|
||||
hugepages but a significant speedup already happens if only one of
|
||||
the two is using hugepages just because of the fact the TLB miss is
|
||||
going to run faster.
|
||||
|
||||
THP can be enabled system wide or restricted to certain tasks or even
|
||||
memory ranges inside task's address space. Unless THP is completely
|
||||
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||
collapses sequences of basic pages into huge pages.
|
||||
|
||||
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||
interface and using madivse(2) and prctl(2) system calls.
|
||||
|
||||
Transparent Hugepage Support maximizes the usefulness of free memory
|
||||
if compared to the reservation approach of hugetlbfs by allowing all
|
||||
unused memory to be used as cache or other movable (or even unmovable
|
||||
entities). It doesn't require reservation to prevent hugepage
|
||||
allocation failures to be noticeable from userland. It allows paging
|
||||
and all other advanced VM features to be available on the
|
||||
hugepages. It requires no modifications for applications to take
|
||||
advantage of it.
|
||||
|
||||
Applications however can be further optimized to take advantage of
|
||||
this feature, like for example they've been optimized before to avoid
|
||||
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
||||
is by far not mandatory and khugepaged already can take care of long
|
||||
lived page allocations even for hugepage unaware applications that
|
||||
deals with large amounts of memory.
|
||||
|
||||
In certain cases when hugepages are enabled system wide, application
|
||||
may end up allocating more memory resources. An application may mmap a
|
||||
large region but only touch 1 byte of it, in that case a 2M page might
|
||||
be allocated instead of a 4k page for no good. This is why it's
|
||||
possible to disable hugepages system-wide and to only have them inside
|
||||
MADV_HUGEPAGE madvise regions.
|
||||
|
||||
Embedded systems should enable hugepages only inside madvise regions
|
||||
to eliminate any risk of wasting any precious byte of memory and to
|
||||
only run faster.
|
||||
|
||||
Applications that gets a lot of benefit from hugepages and that don't
|
||||
risk to lose memory by using hugepages, should use
|
||||
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
||||
|
||||
.. _thp_sysfs:
|
||||
|
||||
sysfs
|
||||
=====
|
||||
|
||||
Global THP controls
|
||||
-------------------
|
||||
|
||||
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||
system wide. This can be achieved with one of::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
It's also possible to limit defrag efforts in the VM to generate
|
||||
anonymous hugepages in case they're not immediately free to madvise
|
||||
regions or to never try to defrag memory and simply fallback to regular
|
||||
pages unless hugepages are immediately available. Clearly if we spend CPU
|
||||
time to defrag memory, we would expect to gain even more by the fact we
|
||||
use hugepages later instead of regular pages. This isn't always
|
||||
guaranteed, but it may be more likely in case the allocation is for a
|
||||
MADV_HUGEPAGE region.
|
||||
|
||||
::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
||||
|
||||
always
|
||||
means that an application requesting THP will stall on
|
||||
allocation failure and directly reclaim pages and compact
|
||||
memory in an effort to allocate a THP immediately. This may be
|
||||
desirable for virtual machines that benefit heavily from THP
|
||||
use and are willing to delay the VM start to utilise them.
|
||||
|
||||
defer
|
||||
means that an application will wake kswapd in the background
|
||||
to reclaim pages and wake kcompactd to compact memory so that
|
||||
THP is available in the near future. It's the responsibility
|
||||
of khugepaged to then install the THP pages later.
|
||||
|
||||
defer+madvise
|
||||
will enter direct reclaim and compaction like ``always``, but
|
||||
only for regions that have used madvise(MADV_HUGEPAGE); all
|
||||
other regions will wake kswapd in the background to reclaim
|
||||
pages and wake kcompactd to compact memory so that THP is
|
||||
available in the near future.
|
||||
|
||||
madvise
|
||||
will enter direct reclaim like ``always`` but only for regions
|
||||
that are have used madvise(MADV_HUGEPAGE). This is the default
|
||||
behaviour.
|
||||
|
||||
never
|
||||
should be self-explanatory.
|
||||
|
||||
By default kernel tries to use huge zero page on read page fault to
|
||||
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||
or enable it back by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
|
||||
Some userspace (such as a test program, or an optimized memory allocation
|
||||
library) may want to know the size (in bytes) of a transparent hugepage::
|
||||
|
||||
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||
|
||||
khugepaged will be automatically started when
|
||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||
be automatically shutdown if it's set to "never".
|
||||
|
||||
Khugepaged controls
|
||||
-------------------
|
||||
|
||||
khugepaged runs usually at low frequency so while one may not want to
|
||||
invoke defrag algorithms synchronously during the page faults, it
|
||||
should be worth invoking defrag at least in khugepaged. However it's
|
||||
also possible to disable defrag in khugepaged by writing 0 or enable
|
||||
defrag in khugepaged by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
||||
|
||||
You can also control how many pages khugepaged should scan at each
|
||||
pass::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
||||
|
||||
and how many milliseconds to wait in khugepaged between each pass (you
|
||||
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
||||
|
||||
and how many milliseconds to wait in khugepaged if there's an hugepage
|
||||
allocation failure to throttle the next allocation attempt::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
||||
|
||||
The khugepaged progress can be seen in the number of pages collapsed::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
||||
|
||||
for each pass::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
||||
|
||||
``max_ptes_none`` specifies how many extra small pages (that are
|
||||
not already mapped) can be allocated when collapsing a group
|
||||
of small pages into one large page::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
||||
|
||||
A higher value leads to use additional memory for programs.
|
||||
A lower value leads to gain less thp performance. Value of
|
||||
max_ptes_none can waste cpu time very little, you can
|
||||
ignore it.
|
||||
|
||||
``max_ptes_swap`` specifies how many pages can be brought in from
|
||||
swap when collapsing a group of pages into a transparent huge page::
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
||||
|
||||
A higher value can cause excessive swap IO and waste
|
||||
memory. A lower value can prevent THPs from being
|
||||
collapsed, resulting fewer pages being collapsed into
|
||||
THPs, and lower memory access performance.
|
||||
|
||||
Boot parameter
|
||||
==============
|
||||
|
||||
You can change the sysfs boot time defaults of Transparent Hugepage
|
||||
Support by passing the parameter ``transparent_hugepage=always`` or
|
||||
``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
|
||||
to the kernel command line.
|
||||
|
||||
Hugepages in tmpfs/shmem
|
||||
========================
|
||||
|
||||
You can control hugepage allocation policy in tmpfs with mount option
|
||||
``huge=``. It can have following values:
|
||||
|
||||
always
|
||||
Attempt to allocate huge pages every time we need a new page;
|
||||
|
||||
never
|
||||
Do not allocate huge pages;
|
||||
|
||||
within_size
|
||||
Only allocate huge page if it will be fully within i_size.
|
||||
Also respect fadvise()/madvise() hints;
|
||||
|
||||
advise
|
||||
Only allocate huge pages if requested with fadvise()/madvise();
|
||||
|
||||
The default policy is ``never``.
|
||||
|
||||
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
||||
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
||||
from being allocated.
|
||||
|
||||
There's also sysfs knob to control hugepage allocation policy for internal
|
||||
shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
|
||||
is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
|
||||
MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
||||
|
||||
In addition to policies listed above, shmem_enabled allows two further
|
||||
values:
|
||||
|
||||
deny
|
||||
For use in emergencies, to force the huge option off from
|
||||
all mounts;
|
||||
force
|
||||
Force the huge option on for all - very useful for testing;
|
||||
|
||||
Need of application restart
|
||||
===========================
|
||||
|
||||
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||
future behavior. So to make them effective you need to restart any
|
||||
application that could have been using hugepages. This also applies to the
|
||||
regions registered in khugepaged.
|
||||
|
||||
Monitoring usage
|
||||
================
|
||||
|
||||
The number of anonymous transparent huge pages currently used by the
|
||||
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||
To identify what applications are using anonymous transparent huge pages,
|
||||
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
||||
for each mapping.
|
||||
|
||||
The number of file transparent huge pages mapped to userspace is available
|
||||
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||
To identify what applications are mapping file transparent huge pages, it
|
||||
is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
|
||||
for each mapping.
|
||||
|
||||
Note that reading the smaps file is expensive and reading it
|
||||
frequently will incur overhead.
|
||||
|
||||
There are a number of counters in ``/proc/vmstat`` that may be used to
|
||||
monitor how successfully the system is providing huge pages for use.
|
||||
|
||||
thp_fault_alloc
|
||||
is incremented every time a huge page is successfully
|
||||
allocated to handle a page fault. This applies to both the
|
||||
first time a page is faulted and for COW faults.
|
||||
|
||||
thp_collapse_alloc
|
||||
is incremented by khugepaged when it has found
|
||||
a range of pages to collapse into one huge page and has
|
||||
successfully allocated a new huge page to store the data.
|
||||
|
||||
thp_fault_fallback
|
||||
is incremented if a page fault fails to allocate
|
||||
a huge page and instead falls back to using small pages.
|
||||
|
||||
thp_collapse_alloc_failed
|
||||
is incremented if khugepaged found a range
|
||||
of pages that should be collapsed into one huge page but failed
|
||||
the allocation.
|
||||
|
||||
thp_file_alloc
|
||||
is incremented every time a file huge page is successfully
|
||||
allocated.
|
||||
|
||||
thp_file_mapped
|
||||
is incremented every time a file huge page is mapped into
|
||||
user address space.
|
||||
|
||||
thp_split_page
|
||||
is incremented every time a huge page is split into base
|
||||
pages. This can happen for a variety of reasons but a common
|
||||
reason is that a huge page is old and is being reclaimed.
|
||||
This action implies splitting all PMD the page mapped with.
|
||||
|
||||
thp_split_page_failed
|
||||
is incremented if kernel fails to split huge
|
||||
page. This can happen if the page was pinned by somebody.
|
||||
|
||||
thp_deferred_split_page
|
||||
is incremented when a huge page is put onto split
|
||||
queue. This happens when a huge page is partially unmapped and
|
||||
splitting it would free up some memory. Pages on split queue are
|
||||
going to be split under memory pressure.
|
||||
|
||||
thp_split_pmd
|
||||
is incremented every time a PMD split into table of PTEs.
|
||||
This can happen, for instance, when application calls mprotect() or
|
||||
munmap() on part of huge page. It doesn't split huge page, only
|
||||
page table entry.
|
||||
|
||||
thp_zero_page_alloc
|
||||
is incremented every time a huge zero page is
|
||||
successfully allocated. It includes allocations which where
|
||||
dropped due race with other allocation. Note, it doesn't count
|
||||
every map of the huge zero page, only its allocation.
|
||||
|
||||
thp_zero_page_alloc_failed
|
||||
is incremented if kernel fails to allocate
|
||||
huge zero page and falls back to using small pages.
|
||||
|
||||
thp_swpout
|
||||
is incremented every time a huge page is swapout in one
|
||||
piece without splitting.
|
||||
|
||||
thp_swpout_fallback
|
||||
is incremented if a huge page has to be split before swapout.
|
||||
Usually because failed to allocate some continuous swap space
|
||||
for the huge page.
|
||||
|
||||
As the system ages, allocating huge pages may be expensive as the
|
||||
system uses memory compaction to copy data around memory to free a
|
||||
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
||||
monitor this overhead.
|
||||
|
||||
compact_stall
|
||||
is incremented every time a process stalls to run
|
||||
memory compaction so that a huge page is free for use.
|
||||
|
||||
compact_success
|
||||
is incremented if the system compacted memory and
|
||||
freed a huge page for use.
|
||||
|
||||
compact_fail
|
||||
is incremented if the system tries to compact memory
|
||||
but failed.
|
||||
|
||||
compact_pages_moved
|
||||
is incremented each time a page is moved. If
|
||||
this value is increasing rapidly, it implies that the system
|
||||
is copying a lot of data to satisfy the huge page allocation.
|
||||
It is possible that the cost of copying exceeds any savings
|
||||
from reduced TLB misses.
|
||||
|
||||
compact_pagemigrate_failed
|
||||
is incremented when the underlying mechanism
|
||||
for moving a page failed.
|
||||
|
||||
compact_blocks_moved
|
||||
is incremented each time memory compaction examines
|
||||
a huge page aligned range of pages.
|
||||
|
||||
It is possible to establish how long the stalls were using the function
|
||||
tracer to record how long was spent in __alloc_pages_nodemask and
|
||||
using the mm_page_alloc tracepoint to identify which allocations were
|
||||
for huge pages.
|
||||
|
||||
Optimizing the applications
|
||||
===========================
|
||||
|
||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||
memory region, the mmap region has to be hugepage naturally
|
||||
aligned. posix_memalign() can provide that guarantee.
|
||||
|
||||
Hugetlbfs
|
||||
=========
|
||||
|
||||
You can use hugetlbfs on a kernel that has transparent hugepage
|
||||
support enabled just fine as always. No difference can be noted in
|
||||
hugetlbfs other than there will be less overall fragmentation. All
|
||||
usual features belonging to hugetlbfs are preserved and
|
||||
unaffected. libhugetlbfs will also work fine as usual.
|
@ -1,6 +1,11 @@
|
||||
= Userfaultfd =
|
||||
.. _userfaultfd:
|
||||
|
||||
== Objective ==
|
||||
===========
|
||||
Userfaultfd
|
||||
===========
|
||||
|
||||
Objective
|
||||
=========
|
||||
|
||||
Userfaults allow the implementation of on-demand paging from userland
|
||||
and more generally they allow userland to take control of various
|
||||
@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do.
|
||||
For example userfaults allows a proper and more optimal implementation
|
||||
of the PROT_NONE+SIGSEGV trick.
|
||||
|
||||
== Design ==
|
||||
Design
|
||||
======
|
||||
|
||||
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||||
|
||||
@ -41,7 +47,8 @@ different processes without them being aware about what is going on
|
||||
themselves on the same region the manager is already tracking, which
|
||||
is a corner case that would currently return -EBUSY).
|
||||
|
||||
== API ==
|
||||
API
|
||||
===
|
||||
|
||||
When first opened the userfaultfd must be enabled invoking the
|
||||
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||||
@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
||||
half copied page since it'll keep userfaulting until the copy has
|
||||
finished.
|
||||
|
||||
== QEMU/KVM ==
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||||
migration. Postcopy live migration is one form of memory
|
||||
@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the
|
||||
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||||
thread).
|
||||
|
||||
== Non-cooperative userfaultfd ==
|
||||
Non-cooperative userfaultfd
|
||||
===========================
|
||||
|
||||
When the userfaultfd is monitored by an external manager, the manager
|
||||
must be able to track changes in the process virtual memory
|
||||
@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The
|
||||
manager has to explicitly enable these events by setting appropriate
|
||||
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
||||
|
||||
UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When
|
||||
this feature is enabled, the userfaultfd context of the parent process
|
||||
is duplicated into the newly created process. The manager receives
|
||||
UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in
|
||||
the uffd_msg.fork.
|
||||
UFFD_FEATURE_EVENT_FORK
|
||||
enable userfaultfd hooks for fork(). When this feature is
|
||||
enabled, the userfaultfd context of the parent process is
|
||||
duplicated into the newly created process. The manager
|
||||
receives UFFD_EVENT_FORK with file descriptor of the new
|
||||
userfaultfd context in the uffd_msg.fork.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap()
|
||||
calls. When the non-cooperative process moves a virtual memory area to
|
||||
a different location, the manager will receive UFFD_EVENT_REMAP. The
|
||||
uffd_msg.remap will contain the old and new addresses of the area and
|
||||
its original length.
|
||||
UFFD_FEATURE_EVENT_REMAP
|
||||
enable notifications about mremap() calls. When the
|
||||
non-cooperative process moves a virtual memory area to a
|
||||
different location, the manager will receive
|
||||
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
|
||||
new addresses of the area and its original length.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMOVE - enable notifications about
|
||||
madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event
|
||||
UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The
|
||||
uffd_msg.remove will contain start and end addresses of the removed
|
||||
area.
|
||||
UFFD_FEATURE_EVENT_REMOVE
|
||||
enable notifications about madvise(MADV_REMOVE) and
|
||||
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
|
||||
be generated upon these calls to madvise. The uffd_msg.remove
|
||||
will contain start and end addresses of the removed area.
|
||||
|
||||
UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory
|
||||
unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove
|
||||
containing start and end addresses of the unmapped area.
|
||||
UFFD_FEATURE_EVENT_UNMAP
|
||||
enable notifications about memory unmapping. The manager will
|
||||
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
|
||||
end addresses of the unmapped area.
|
||||
|
||||
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
||||
are pretty similar, they quite differ in the action expected from the
|
@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners:
|
||||
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
|
||||
|
||||
B. Use Device Tree bindings, as described in
|
||||
``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``.
|
||||
``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
|
||||
For example::
|
||||
|
||||
reserved-memory {
|
||||
|
@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions)
|
||||
88DE3010, Armada 1000 (no Linux support)
|
||||
Core: Marvell PJ1 (ARMv5TE), Dual-core
|
||||
Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf
|
||||
88DE3005, Armada 1500-mini
|
||||
88DE3005, Armada 1500 Mini
|
||||
Design name: BG2CD
|
||||
Core: ARM Cortex-A9, PL310 L2CC
|
||||
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/
|
||||
88DE3006, Armada 1500 Mini Plus
|
||||
Design name: BG2CDP
|
||||
Core: Dual Core ARM Cortex-A7
|
||||
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/
|
||||
88DE3006, Armada 1500 Mini Plus
|
||||
Design name: BG2CDP
|
||||
Core: Dual Core ARM Cortex-A7
|
||||
88DE3100, Armada 1500
|
||||
Design name: BG2
|
||||
Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC
|
||||
Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf
|
||||
88DE3114, Armada 1500 Pro
|
||||
Design name: BG2Q
|
||||
Core: Quad Core ARM Cortex-A9, PL310 L2CC
|
||||
@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions)
|
||||
88DE3218, ARMADA 1500 Ultra
|
||||
Core: ARM Cortex-A53
|
||||
|
||||
Homepage: http://www.marvell.com/multimedia-solutions/
|
||||
Homepage: https://www.synaptics.com/products/multimedia-solutions
|
||||
Directory: arch/arm/mach-berlin
|
||||
|
||||
Comments:
|
||||
|
||||
* This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs
|
||||
with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...).
|
||||
|
||||
* The Berlin family was acquired by Synaptics from Marvell in 2017.
|
||||
|
||||
CPU Cores
|
||||
---------
|
||||
|
||||
|
@ -1,7 +1,9 @@
|
||||
Embedded device command line partition parsing
|
||||
=====================================================================
|
||||
|
||||
Support for reading the block device partition table from the command line.
|
||||
The "blkdevparts" command line option adds support for reading the
|
||||
block device partition table from the kernel command line.
|
||||
|
||||
It is typically used for fixed block (eMMC) embedded devices.
|
||||
It has no MBR, so saves storage space. Bootloader can be easily accessed
|
||||
by absolute address of data on the block device.
|
||||
@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>]
|
||||
<partdef> := <size>[@<offset>](part-name)
|
||||
|
||||
<blkdev-id>
|
||||
block device disk name, embedded device used fixed block device,
|
||||
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
||||
block device disk name. Embedded device uses fixed block device.
|
||||
Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0.
|
||||
|
||||
<size>
|
||||
partition size, in bytes, such as: 512, 1m, 1G.
|
||||
size may contain an optional suffix of (upper or lower case):
|
||||
K, M, G, T, P, E.
|
||||
"-" is used to denote all remaining space.
|
||||
|
||||
<offset>
|
||||
partition start address, in bytes.
|
||||
offset may contain an optional suffix of (upper or lower case):
|
||||
K, M, G, T, P, E.
|
||||
|
||||
(part-name)
|
||||
partition name, kernel send uevent with "PARTNAME". application can create
|
||||
a link to block device partition with the name "PARTNAME".
|
||||
user space application can access partition by partition name.
|
||||
partition name. Kernel sends uevent with "PARTNAME". Application can
|
||||
create a link to block device partition with the name "PARTNAME".
|
||||
User space application can access partition by partition name.
|
||||
|
||||
Example:
|
||||
eMMC disk name is "mmcblk0" and "mmcblk0boot0"
|
||||
eMMC disk names are "mmcblk0" and "mmcblk0boot0".
|
||||
|
||||
bootargs:
|
||||
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
|
||||
|
66
Documentation/core-api/gfp_mask-from-fs-io.rst
Normal file
66
Documentation/core-api/gfp_mask-from-fs-io.rst
Normal file
@ -0,0 +1,66 @@
|
||||
=================================
|
||||
GFP masks used from FS/IO context
|
||||
=================================
|
||||
|
||||
:Date: May, 2018
|
||||
:Author: Michal Hocko <mhocko@kernel.org>
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Code paths in the filesystem and IO stacks must be careful when
|
||||
allocating memory to prevent recursion deadlocks caused by direct
|
||||
memory reclaim calling back into the FS or IO paths and blocking on
|
||||
already held resources (e.g. locks - most commonly those used for the
|
||||
transaction context).
|
||||
|
||||
The traditional way to avoid this deadlock problem is to clear __GFP_FS
|
||||
respectively __GFP_IO (note the latter implies clearing the first as well) in
|
||||
the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
|
||||
used as shortcut. It turned out though that above approach has led to
|
||||
abuses when the restricted gfp mask is used "just in case" without a
|
||||
deeper consideration which leads to problems because an excessive use
|
||||
of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
|
||||
reclaim issues.
|
||||
|
||||
New API
|
||||
========
|
||||
|
||||
Since 4.12 we do have a generic scope API for both NOFS and NOIO context
|
||||
``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
|
||||
``memalloc_noio_restore`` which allow to mark a scope to be a critical
|
||||
section from a filesystem or I/O point of view. Any allocation from that
|
||||
scope will inherently drop __GFP_FS respectively __GFP_IO from the given
|
||||
mask so no memory allocation can recurse back in the FS/IO.
|
||||
|
||||
.. kernel-doc:: include/linux/sched/mm.h
|
||||
:functions: memalloc_nofs_save memalloc_nofs_restore
|
||||
.. kernel-doc:: include/linux/sched/mm.h
|
||||
:functions: memalloc_noio_save memalloc_noio_restore
|
||||
|
||||
FS/IO code then simply calls the appropriate save function before
|
||||
any critical section with respect to the reclaim is started - e.g.
|
||||
lock shared with the reclaim context or when a transaction context
|
||||
nesting would be possible via reclaim. The restore function should be
|
||||
called when the critical section ends. All that ideally along with an
|
||||
explanation what is the reclaim context for easier maintenance.
|
||||
|
||||
Please note that the proper pairing of save/restore functions
|
||||
allows nesting so it is safe to call ``memalloc_noio_save`` or
|
||||
``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
|
||||
scope.
|
||||
|
||||
What about __vmalloc(GFP_NOFS)
|
||||
==============================
|
||||
|
||||
vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
|
||||
GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
|
||||
to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
|
||||
almost always a bug. The good news is that the NOFS/NOIO semantic can be
|
||||
achieved by the scope API.
|
||||
|
||||
In the ideal world, upper layers should already mark dangerous contexts
|
||||
and so no special care is required and vmalloc should be called without
|
||||
any problems. Sometimes if the context is not really clear or there are
|
||||
layering violations then the recommended way around that is to wrap ``vmalloc``
|
||||
by the scope API with a comment explaining the problem.
|
@ -14,6 +14,7 @@ Core utilities
|
||||
kernel-api
|
||||
assoc_array
|
||||
atomic_ops
|
||||
cachetlb
|
||||
refcount-vs-atomic
|
||||
cpu_hotplug
|
||||
idr
|
||||
@ -25,6 +26,8 @@ Core utilities
|
||||
genalloc
|
||||
errseq
|
||||
printk-formats
|
||||
circular-buffers
|
||||
gfp_mask-from-fs-io
|
||||
|
||||
Interfaces for kernel debugging
|
||||
===============================
|
||||
|
@ -39,17 +39,17 @@ String Manipulation
|
||||
.. kernel-doc:: lib/string.c
|
||||
:export:
|
||||
|
||||
Basic Kernel Library Functions
|
||||
==============================
|
||||
|
||||
The Linux kernel provides more basic utility functions.
|
||||
|
||||
Bit Operations
|
||||
--------------
|
||||
|
||||
.. kernel-doc:: arch/x86/include/asm/bitops.h
|
||||
:internal:
|
||||
|
||||
Basic Kernel Library Functions
|
||||
==============================
|
||||
|
||||
The Linux kernel provides more basic utility functions.
|
||||
|
||||
Bitmap Operations
|
||||
-----------------
|
||||
|
||||
@ -80,6 +80,31 @@ Command-line Parsing
|
||||
.. kernel-doc:: lib/cmdline.c
|
||||
:export:
|
||||
|
||||
Sorting
|
||||
-------
|
||||
|
||||
.. kernel-doc:: lib/sort.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: lib/list_sort.c
|
||||
:export:
|
||||
|
||||
Text Searching
|
||||
--------------
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:doc: ts_intro
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: include/linux/textsearch.h
|
||||
:functions: textsearch_find textsearch_next \
|
||||
textsearch_get_pattern textsearch_get_pattern_len
|
||||
|
||||
CRC and Math Functions in Linux
|
||||
===============================
|
||||
|
||||
CRC Functions
|
||||
-------------
|
||||
|
||||
@ -103,9 +128,6 @@ CRC Functions
|
||||
.. kernel-doc:: lib/crc-itu-t.c
|
||||
:export:
|
||||
|
||||
Math Functions in Linux
|
||||
=======================
|
||||
|
||||
Base 2 log and power Functions
|
||||
------------------------------
|
||||
|
||||
@ -127,28 +149,6 @@ Division Functions
|
||||
.. kernel-doc:: lib/gcd.c
|
||||
:export:
|
||||
|
||||
Sorting
|
||||
-------
|
||||
|
||||
.. kernel-doc:: lib/sort.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: lib/list_sort.c
|
||||
:export:
|
||||
|
||||
Text Searching
|
||||
--------------
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:doc: ts_intro
|
||||
|
||||
.. kernel-doc:: lib/textsearch.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: include/linux/textsearch.h
|
||||
:functions: textsearch_find textsearch_next \
|
||||
textsearch_get_pattern textsearch_get_pattern_len
|
||||
|
||||
UUID/GUID
|
||||
---------
|
||||
|
||||
|
@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in
|
||||
these memory ordering guarantees.
|
||||
|
||||
The terms used through this document try to follow the formal LKMM defined in
|
||||
github.com/aparri/memory-model/blob/master/Documentation/explanation.txt
|
||||
tools/memory-model/Documentation/explanation.txt.
|
||||
|
||||
memory-barriers.txt and atomic_t.txt provide more background to the
|
||||
memory ordering in general and for atomic operations specifically.
|
||||
|
@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples.
|
||||
architecture
|
||||
devel-algos
|
||||
userspace-if
|
||||
crypto_engine
|
||||
api
|
||||
api-samples
|
||||
|
@ -120,7 +120,7 @@ A typical out of bounds access report looks like this::
|
||||
|
||||
The header of the report discribe what kind of bug happened and what kind of
|
||||
access caused it. It's followed by the description of the accessed slub object
|
||||
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and
|
||||
(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and
|
||||
the description of the accessed memory page.
|
||||
|
||||
In the last section the report shows memory state around the accessed address.
|
||||
|
@ -151,6 +151,11 @@ Contributing new tests (details)
|
||||
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
|
||||
test.
|
||||
|
||||
* First use the headers inside the kernel source and/or git repo, and then the
|
||||
system headers. Headers for the kernel release as opposed to headers
|
||||
installed by the distro on the system should be the primary focus to be able
|
||||
to find regressions.
|
||||
|
||||
Test Harness
|
||||
============
|
||||
|
||||
|
@ -40,4 +40,4 @@ API
|
||||
---
|
||||
|
||||
.. kernel-doc:: drivers/base/devcon.c
|
||||
: functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove
|
||||
:functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove
|
||||
|
@ -44,7 +44,7 @@ common to each controller of that type:
|
||||
|
||||
- methods to establish GPIO line direction
|
||||
- methods used to access GPIO line values
|
||||
- method to set electrical configuration to a a given GPIO line
|
||||
- method to set electrical configuration for a given GPIO line
|
||||
- method to return the IRQ number associated to a given GPIO line
|
||||
- flag saying whether calls to its methods may sleep
|
||||
- optional line names array to identify lines
|
||||
@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on
|
||||
the rail actively pulls it down.
|
||||
|
||||
The level on the line will go as high as the VDD on the pull-up resistor, which
|
||||
may be higher than the level supported by the transistor, achieveing a
|
||||
may be higher than the level supported by the transistor, achieving a
|
||||
level-shift to the higher VDD.
|
||||
|
||||
Integrated electronics often have an output driver stage in the form of a CMOS
|
||||
@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips
|
||||
|
||||
Any provider of irqchips needs to be carefully tailored to support Real Time
|
||||
preemption. It is desirable that all irqchips in the GPIO subsystem keep this
|
||||
in mind and does the proper testing to assure they are real time-enabled.
|
||||
in mind and do the proper testing to assure they are real time-enabled.
|
||||
So, pay attention on above " RT_FULL:" notes, please.
|
||||
The following is a checklist to follow when preparing a driver for real
|
||||
time-compliance:
|
||||
|
@ -17,7 +17,9 @@ available subsections can be seen below.
|
||||
basics
|
||||
infrastructure
|
||||
pm/index
|
||||
clk
|
||||
device-io
|
||||
device_connection
|
||||
dma-buf
|
||||
device_link
|
||||
message-based
|
||||
|
@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources:
|
||||
|
||||
If a subchannel is created by a request to host, then the uio_hv_generic
|
||||
device driver will create a sysfs binary file for the per-channel ring buffer.
|
||||
For example:
|
||||
For example::
|
||||
|
||||
/sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
|
||||
|
||||
Further information
|
||||
|
@ -1,7 +1,7 @@
|
||||
#
|
||||
# Feature name: strncasecmp
|
||||
# Kconfig: __HAVE_ARCH_STRNCASECMP
|
||||
# description: arch provides an optimized strncasecmp() function
|
||||
# Feature name: cBPF-JIT
|
||||
# Kconfig: HAVE_CBPF_JIT
|
||||
# description: arch supports cBPF JIT optimizations
|
||||
#
|
||||
-----------------------
|
||||
| arch |status|
|
||||
@ -16,14 +16,16 @@
|
||||
| ia64: | TODO |
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| sparc: | ok |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
| x86: | TODO |
|
@ -1,7 +1,7 @@
|
||||
#
|
||||
# Feature name: BPF-JIT
|
||||
# Kconfig: HAVE_BPF_JIT
|
||||
# description: arch supports BPF JIT optimizations
|
||||
# Feature name: eBPF-JIT
|
||||
# Kconfig: HAVE_EBPF_JIT
|
||||
# description: arch supports eBPF JIT optimizations
|
||||
#
|
||||
-----------------------
|
||||
| arch |status|
|
||||
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | ok |
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| openrisc: | ok |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | ok |
|
||||
| nios2: | ok |
|
||||
| openrisc: | ok |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,15 +17,17 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
| x86: | ok | 64-bit only
|
||||
| x86: | ok |
|
||||
| xtensa: | ok |
|
||||
-----------------------
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | ok |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | TODO |
|
||||
|
@ -11,16 +11,18 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| h8300: | ok |
|
||||
| hexagon: | ok |
|
||||
| ia64: | TODO |
|
||||
| m68k: | TODO |
|
||||
| microblaze: | ok |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | ok |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | ok |
|
||||
| arm: | ok |
|
||||
| arm64: | TODO |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | ok |
|
||||
| arm: | ok |
|
||||
| arm64: | TODO |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | ok |
|
||||
| sparc: | TODO |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | TODO |
|
||||
| arm: | ok |
|
||||
| arm64: | TODO |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
@ -17,13 +17,15 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| sparc: | ok |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
| x86: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,11 +17,13 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| s390: | TODO |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| um: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | ok |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | TODO |
|
||||
| arm: | TODO |
|
||||
| arm64: | TODO |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | ok |
|
||||
| mips: | ok |
|
||||
| nds32: | ok |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| openrisc: | ok |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -9,21 +9,23 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | TODO |
|
||||
| arm: | TODO |
|
||||
| arm64: | TODO |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| openrisc: | ok |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| sparc: | ok |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
| x86: | ok |
|
||||
|
@ -16,14 +16,16 @@
|
||||
| ia64: | TODO |
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| openrisc: | ok |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| sparc: | ok |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
| x86: | ok |
|
||||
|
@ -1,6 +1,6 @@
|
||||
#
|
||||
# Feature name: rwsem-optimized
|
||||
# Kconfig: Optimized asm/rwsem.h
|
||||
# Kconfig: !RWSEM_GENERIC_SPINLOCK
|
||||
# description: arch provides optimized rwsem APIs
|
||||
#
|
||||
-----------------------
|
||||
@ -8,8 +8,8 @@
|
||||
-----------------------
|
||||
| alpha: | ok |
|
||||
| arc: | TODO |
|
||||
| arm: | TODO |
|
||||
| arm64: | TODO |
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
@ -17,14 +17,16 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
| um: | TODO |
|
||||
| um: | ok |
|
||||
| unicore32: | TODO |
|
||||
| x86: | ok |
|
||||
| xtensa: | ok |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | TODO |
|
||||
| arm: | ok |
|
||||
| arm64: | TODO |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | ok |
|
||||
@ -17,13 +17,15 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | ok |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | TODO |
|
||||
| sparc: | ok |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
| x86: | ok |
|
||||
|
@ -17,11 +17,13 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| s390: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| um: | TODO |
|
||||
|
@ -17,11 +17,13 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| s390: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| um: | TODO |
|
||||
|
@ -40,10 +40,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | .. |
|
||||
| arm: | .. |
|
||||
| arm64: | .. |
|
||||
| arm64: | ok |
|
||||
| c6x: | .. |
|
||||
| h8300: | .. |
|
||||
| hexagon: | .. |
|
||||
@ -17,11 +17,13 @@
|
||||
| m68k: | .. |
|
||||
| microblaze: | .. |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | .. |
|
||||
| openrisc: | .. |
|
||||
| parisc: | .. |
|
||||
| powerpc: | ok |
|
||||
| s390: | .. |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | .. |
|
||||
| sparc: | TODO |
|
||||
| um: | .. |
|
||||
|
98
Documentation/features/scripts/features-refresh.sh
Executable file
98
Documentation/features/scripts/features-refresh.sh
Executable file
@ -0,0 +1,98 @@
|
||||
#
|
||||
# Small script that refreshes the kernel feature support status in place.
|
||||
#
|
||||
|
||||
for F_FILE in Documentation/features/*/*/arch-support.txt; do
|
||||
F=$(grep "^# Kconfig:" "$F_FILE" | cut -c26-)
|
||||
|
||||
#
|
||||
# Each feature F is identified by a pair (O, K), where 'O' can
|
||||
# be either the empty string (for 'nop') or "not" (the logical
|
||||
# negation operator '!'); other operators are not supported.
|
||||
#
|
||||
O=""
|
||||
K=$F
|
||||
if [[ "$F" == !* ]]; then
|
||||
O="not"
|
||||
K=$(echo $F | sed -e 's/^!//g')
|
||||
fi
|
||||
|
||||
#
|
||||
# F := (O, K) is 'valid' iff there is a Kconfig file (for some
|
||||
# arch) which contains K.
|
||||
#
|
||||
# Notice that this definition entails an 'asymmetry' between
|
||||
# the case 'O = ""' and the case 'O = "not"'. E.g., F may be
|
||||
# _invalid_ if:
|
||||
#
|
||||
# [case 'O = ""']
|
||||
# 1) no arch provides support for F,
|
||||
# 2) K does not exist (e.g., it was renamed/mis-typed);
|
||||
#
|
||||
# [case 'O = "not"']
|
||||
# 3) all archs provide support for F,
|
||||
# 4) as in (2).
|
||||
#
|
||||
# The rationale for adopting this definition (and, thus, for
|
||||
# keeping the asymmetry) is:
|
||||
#
|
||||
# We want to be able to 'detect' (2) (or (4)).
|
||||
#
|
||||
# (1) and (3) may further warn the developers about the fact
|
||||
# that K can be removed.
|
||||
#
|
||||
F_VALID="false"
|
||||
for ARCH_DIR in arch/*/; do
|
||||
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
|
||||
K_GREP=$(grep "$K" $K_FILES)
|
||||
if [ ! -z "$K_GREP" ]; then
|
||||
F_VALID="true"
|
||||
break
|
||||
fi
|
||||
done
|
||||
if [ "$F_VALID" = "false" ]; then
|
||||
printf "WARNING: '%s' is not a valid Kconfig\n" "$F"
|
||||
fi
|
||||
|
||||
T_FILE="$F_FILE.tmp"
|
||||
grep "^#" $F_FILE > $T_FILE
|
||||
echo " -----------------------" >> $T_FILE
|
||||
echo " | arch |status|" >> $T_FILE
|
||||
echo " -----------------------" >> $T_FILE
|
||||
for ARCH_DIR in arch/*/; do
|
||||
ARCH=$(echo $ARCH_DIR | sed -e 's/arch//g' | sed -e 's/\///g')
|
||||
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
|
||||
K_GREP=$(grep "$K" $K_FILES)
|
||||
#
|
||||
# Arch support status values for (O, K) are updated according
|
||||
# to the following rules.
|
||||
#
|
||||
# - ("", K) is 'supported by a given arch', if there is a
|
||||
# Kconfig file for that arch which contains K;
|
||||
#
|
||||
# - ("not", K) is 'supported by a given arch', if there is
|
||||
# no Kconfig file for that arch which contains K;
|
||||
#
|
||||
# - otherwise: preserve the previous status value (if any),
|
||||
# default to 'not yet supported'.
|
||||
#
|
||||
# Notice that, according these rules, invalid features may be
|
||||
# updated/modified.
|
||||
#
|
||||
if [ "$O" = "" ] && [ ! -z "$K_GREP" ]; then
|
||||
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
|
||||
elif [ "$O" = "not" ] && [ -z "$K_GREP" ]; then
|
||||
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
|
||||
else
|
||||
S=$(grep -v "^#" "$F_FILE" | grep " $ARCH:")
|
||||
if [ ! -z "$S" ]; then
|
||||
echo "$S" >> $T_FILE
|
||||
else
|
||||
printf " |%12s: | TODO |\n" "$ARCH" \
|
||||
>> $T_FILE
|
||||
fi
|
||||
fi
|
||||
done
|
||||
echo " -----------------------" >> $T_FILE
|
||||
mv $T_FILE $F_FILE
|
||||
done
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,12 +17,14 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sh: | ok |
|
||||
| sparc: | TODO |
|
||||
| um: | TODO |
|
||||
| unicore32: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | ok |
|
||||
| microblaze: | ok |
|
||||
| mips: | ok |
|
||||
| nds32: | ok |
|
||||
| nios2: | ok |
|
||||
| openrisc: | ok |
|
||||
| parisc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | .. |
|
||||
| powerpc: | .. |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | .. |
|
||||
| sh: | TODO |
|
||||
| sparc: | .. |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | ok |
|
||||
| mips: | ok |
|
||||
| nds32: | ok |
|
||||
| nios2: | ok |
|
||||
| openrisc: | ok |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | .. |
|
||||
| microblaze: | .. |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | .. |
|
||||
| openrisc: | .. |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | .. |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | .. |
|
||||
| microblaze: | .. |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | .. |
|
||||
| openrisc: | .. |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | ok |
|
||||
| sparc: | TODO |
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | .. |
|
||||
| arm: | .. |
|
||||
| arm64: | .. |
|
||||
| arm64: | ok |
|
||||
| c6x: | .. |
|
||||
| h8300: | .. |
|
||||
| hexagon: | .. |
|
||||
@ -17,10 +17,12 @@
|
||||
| m68k: | .. |
|
||||
| microblaze: | ok |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | .. |
|
||||
| openrisc: | .. |
|
||||
| parisc: | .. |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -17,10 +17,12 @@
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -515,7 +515,8 @@ guarantees:
|
||||
|
||||
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
|
||||
bits on both physical and virtual pages associated with a process, and the
|
||||
soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
|
||||
soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
|
||||
for details).
|
||||
To clear the bits for all the pages associated with the process
|
||||
> echo 1 > /proc/PID/clear_refs
|
||||
|
||||
@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect.
|
||||
|
||||
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
|
||||
using /proc/kpageflags and number of times a page is mapped using
|
||||
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
|
||||
/proc/kpagecount. For detailed explanation, see
|
||||
Documentation/admin-guide/mm/pagemap.rst.
|
||||
|
||||
The /proc/pid/numa_maps is an extension based on maps, showing the memory
|
||||
locality and binding policy, as well as the memory usage (in pages) of
|
||||
@ -564,7 +566,7 @@ address policy mapping details
|
||||
|
||||
Where:
|
||||
"address" is the starting address for the mapping;
|
||||
"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt);
|
||||
"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
|
||||
"mapping details" summarizes mapping data such as mapping type, page usage counters,
|
||||
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
|
||||
size, in KB, that is backing the mapping up.
|
||||
|
@ -105,8 +105,9 @@ policy for the file will revert to "default" policy.
|
||||
NUMA memory allocation policies have optional flags that can be used in
|
||||
conjunction with their modes. These optional flags can be specified
|
||||
when tmpfs is mounted by appending them to the mode before the NodeList.
|
||||
See Documentation/vm/numa_memory_policy.txt for a list of all available
|
||||
memory allocation policy mode flags and their effect on memory policy.
|
||||
See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
|
||||
all available memory allocation policy mode flags and their effect on
|
||||
memory policy.
|
||||
|
||||
=static is equivalent to MPOL_F_STATIC_NODES
|
||||
=relative is equivalent to MPOL_F_RELATIVE_NODES
|
||||
|
@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
userspace-api/index
|
||||
userspace-api/index
|
||||
|
||||
|
||||
Introduction to kernel development
|
||||
@ -89,6 +89,7 @@ needed).
|
||||
sound/index
|
||||
crypto/index
|
||||
filesystems/index
|
||||
vm/index
|
||||
|
||||
Architecture-specific documentation
|
||||
-----------------------------------
|
||||
|
@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface.
|
||||
future extensions is going right down the gutters since someone will submit
|
||||
an ioctl struct with random stack garbage in the yet unused parts. Which
|
||||
then bakes in the ABI that those fields can never be used for anything else
|
||||
but garbage.
|
||||
but garbage. This is also the reason why you must explicitly pad all
|
||||
structures, even if you never use them in an array - the padding the compiler
|
||||
might insert could contain garbage.
|
||||
|
||||
* Have simple testcases for all of the above.
|
||||
|
||||
|
@ -2903,7 +2903,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the
|
||||
appropriate part of the kernel must invalidate the overlapping bits of the
|
||||
cache on each CPU.
|
||||
|
||||
See Documentation/cachetlb.txt for more information on cache management.
|
||||
See Documentation/core-api/cachetlb.rst for more information on cache management.
|
||||
|
||||
|
||||
CACHE COHERENCY VS MMIO
|
||||
@ -3083,7 +3083,7 @@ CIRCULAR BUFFERS
|
||||
Memory barriers can be used to implement circular buffering without the need
|
||||
of a lock to serialise the producer with the consumer. See:
|
||||
|
||||
Documentation/circular-buffers.txt
|
||||
Documentation/core-api/circular-buffers.rst
|
||||
|
||||
for details.
|
||||
|
||||
|
@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent
|
||||
release history looks like this:
|
||||
|
||||
====== =================
|
||||
2.6.38 March 14, 2011
|
||||
2.6.37 January 4, 2011
|
||||
2.6.36 October 20, 2010
|
||||
2.6.35 August 1, 2010
|
||||
2.6.34 May 15, 2010
|
||||
2.6.33 February 24, 2010
|
||||
4.11 April 30, 2017
|
||||
4.12 July 2, 2017
|
||||
4.13 September 3, 2017
|
||||
4.14 November 12, 2017
|
||||
4.15 January 28, 2018
|
||||
4.16 April 1, 2018
|
||||
====== =================
|
||||
|
||||
Every 2.6.x release is a major kernel release with new features, internal
|
||||
API changes, and more. A typical 2.6 release can contain nearly 10,000
|
||||
changesets with changes to several hundred thousand lines of code. 2.6 is
|
||||
Every 4.x release is a major kernel release with new features, internal
|
||||
API changes, and more. A typical 4.x release contain about 13,000
|
||||
changesets with changes to several hundred thousand lines of code. 4.x is
|
||||
thus the leading edge of Linux kernel development; the kernel uses a
|
||||
rolling development model which is continually integrating major changes.
|
||||
|
||||
@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is
|
||||
considered to be sufficiently stable and the final 2.6.x release is made.
|
||||
At that point the whole process starts over again.
|
||||
|
||||
As an example, here is how the 2.6.38 development cycle went (all dates in
|
||||
2011):
|
||||
As an example, here is how the 4.16 development cycle went (all dates in
|
||||
2018):
|
||||
|
||||
============== ===============================
|
||||
January 4 2.6.37 stable release
|
||||
January 18 2.6.38-rc1, merge window closes
|
||||
January 21 2.6.38-rc2
|
||||
February 1 2.6.38-rc3
|
||||
February 7 2.6.38-rc4
|
||||
February 15 2.6.38-rc5
|
||||
February 21 2.6.38-rc6
|
||||
March 1 2.6.38-rc7
|
||||
March 7 2.6.38-rc8
|
||||
March 14 2.6.38 stable release
|
||||
January 28 4.15 stable release
|
||||
February 11 4.16-rc1, merge window closes
|
||||
February 18 4.16-rc2
|
||||
February 25 4.16-rc3
|
||||
March 4 4.16-rc4
|
||||
March 11 4.16-rc5
|
||||
March 18 4.16-rc6
|
||||
March 25 4.16-rc7
|
||||
April 1 4.17 stable release
|
||||
============== ===============================
|
||||
|
||||
How do the developers decide when to close the development cycle and create
|
||||
@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to
|
||||
achieve; there are just too many variables in a project of this size.
|
||||
There comes a point where delaying the final release just makes the problem
|
||||
worse; the pile of changes waiting for the next merge window will grow
|
||||
larger, creating even more regressions the next time around. So most 2.6.x
|
||||
larger, creating even more regressions the next time around. So most 4.x
|
||||
kernels go out with a handful of known regressions though, hopefully, none
|
||||
of them are serious.
|
||||
|
||||
Once a stable release is made, its ongoing maintenance is passed off to the
|
||||
"stable team," currently consisting of Greg Kroah-Hartman. The stable team
|
||||
will release occasional updates to the stable release using the 2.6.x.y
|
||||
will release occasional updates to the stable release using the 4.x.y
|
||||
numbering scheme. To be considered for an update release, a patch must (1)
|
||||
fix a significant bug, and (2) already be merged into the mainline for the
|
||||
next development kernel. Kernels will typically receive stable updates for
|
||||
a little more than one development cycle past their initial release. So,
|
||||
for example, the 2.6.36 kernel's history looked like:
|
||||
for example, the 4.13 kernel's history looked like:
|
||||
|
||||
============== ===============================
|
||||
October 10 2.6.36 stable release
|
||||
November 22 2.6.36.1
|
||||
December 9 2.6.36.2
|
||||
January 7 2.6.36.3
|
||||
February 17 2.6.36.4
|
||||
September 3 4.13 stable release
|
||||
September 13 4.13.1
|
||||
September 20 4.13.2
|
||||
September 27 4.13.3
|
||||
October 5 4.13.4
|
||||
October 12 4.13.5
|
||||
... ...
|
||||
November 24 4.13.16
|
||||
============== ===============================
|
||||
|
||||
2.6.36.4 was the final stable update for the 2.6.36 release.
|
||||
4.13.16 was the final stable update of the 4.13 release.
|
||||
|
||||
Some kernels are designated "long term" kernels; they will receive support
|
||||
for a longer period. As of this writing, the current long term kernels
|
||||
and their maintainers are:
|
||||
|
||||
====== ====================== ===========================
|
||||
2.6.27 Willy Tarreau (Deep-frozen stable kernel)
|
||||
2.6.32 Greg Kroah-Hartman
|
||||
2.6.35 Andi Kleen (Embedded flag kernel)
|
||||
====== ====================== ==============================
|
||||
3.16 Ben Hutchings (very long-term stable kernel)
|
||||
4.1 Sasha Levin
|
||||
4.4 Greg Kroah-Hartman (very long-term stable kernel)
|
||||
4.9 Greg Kroah-Hartman
|
||||
4.14 Greg Kroah-Hartman
|
||||
====== ====================== ===========================
|
||||
|
||||
The selection of a kernel for long-term support is purely a matter of a
|
||||
|
@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches;
|
||||
following them will make life much easier for everybody involved. This
|
||||
document will attempt to cover these expectations in reasonable detail;
|
||||
more information can also be found in the files process/submitting-patches.rst,
|
||||
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel documentation
|
||||
directory.
|
||||
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel
|
||||
documentation directory.
|
||||
|
||||
|
||||
When to post
|
||||
@ -198,8 +198,8 @@ pass it to diff with the "-X" option.
|
||||
|
||||
The tags mentioned above are used to describe how various developers have
|
||||
been associated with the development of this patch. They are described in
|
||||
detail in the process/submitting-patches.rst document; what follows here is a brief
|
||||
summary. Each of these lines has the format:
|
||||
detail in the process/submitting-patches.rst document; what follows here is a
|
||||
brief summary. Each of these lines has the format:
|
||||
|
||||
::
|
||||
|
||||
@ -210,8 +210,8 @@ The tags in common use are:
|
||||
- Signed-off-by: this is a developer's certification that he or she has
|
||||
the right to submit the patch for inclusion into the kernel. It is an
|
||||
agreement to the Developer's Certificate of Origin, the full text of
|
||||
which can be found in Documentation/process/submitting-patches.rst. Code without a
|
||||
proper signoff cannot be merged into the mainline.
|
||||
which can be found in Documentation/process/submitting-patches.rst. Code
|
||||
without a proper signoff cannot be merged into the mainline.
|
||||
|
||||
- Co-developed-by: states that the patch was also created by another developer
|
||||
along with the original author. This is useful at times when multiple
|
||||
@ -226,8 +226,8 @@ The tags in common use are:
|
||||
it to work.
|
||||
|
||||
- Reviewed-by: the named developer has reviewed the patch for correctness;
|
||||
see the reviewer's statement in Documentation/process/submitting-patches.rst for more
|
||||
detail.
|
||||
see the reviewer's statement in Documentation/process/submitting-patches.rst
|
||||
for more detail.
|
||||
|
||||
- Reported-by: names a user who reported a problem which is fixed by this
|
||||
patch; this tag is used to give credit to the (often underappreciated)
|
||||
|
@ -52,6 +52,7 @@ lack of a better place.
|
||||
adding-syscalls
|
||||
magic-number
|
||||
volatile-considered-harmful
|
||||
clang-format
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
|
@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so
|
||||
if you only have a combined **[SC]** key, then you should create a separate
|
||||
signing subkey::
|
||||
|
||||
$ gpg --quick-add-key [fpr] ed25519 sign
|
||||
$ gpg --quick-addkey [fpr] ed25519 sign
|
||||
|
||||
Remember to tell the keyservers about this change, so others can pull down
|
||||
your new subkey::
|
||||
@ -450,11 +450,18 @@ functionality. There are several options available:
|
||||
others. If you want to use ECC keys, your best bet among commercially
|
||||
available devices is the Nitrokey Start.
|
||||
|
||||
.. note::
|
||||
|
||||
If you are listed in MAINTAINERS or have an account at kernel.org,
|
||||
you `qualify for a free Nitrokey Start`_ courtesy of The Linux
|
||||
Foundation.
|
||||
|
||||
.. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6
|
||||
.. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3
|
||||
.. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/
|
||||
.. _Gnuk: http://www.fsij.org/doc-gnuk/
|
||||
.. _`LWN has a good review`: https://lwn.net/Articles/736231/
|
||||
.. _`qualify for a free Nitrokey Start`: https://www.kernel.org/nitrokey-digital-tokens-for-kernel-developers.html
|
||||
|
||||
Configure your smartcard device
|
||||
-------------------------------
|
||||
@ -482,7 +489,7 @@ there are no convenient command-line switches::
|
||||
You should set the user PIN (1), Admin PIN (3), and the Reset Code (4).
|
||||
Please make sure to record and store these in a safe place -- especially
|
||||
the Admin PIN and the Reset Code (which allows you to completely wipe
|
||||
the smartcard). You so rarely need to use the Admin PIN, that you will
|
||||
the smartcard). You so rarely need to use the Admin PIN, that you will
|
||||
inevitably forget what it is if you do not record it.
|
||||
|
||||
Getting back to the main card menu, you can also set other values (such
|
||||
@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it.
|
||||
Despite having the name "PIN", neither the user PIN nor the admin
|
||||
PIN on the card need to be numbers.
|
||||
|
||||
.. warning::
|
||||
|
||||
Some devices may require that you move the subkeys onto the device
|
||||
before you can change the passphrase. Please check the documentation
|
||||
provided by the device manufacturer.
|
||||
|
||||
Move the subkeys to your smartcard
|
||||
----------------------------------
|
||||
|
||||
@ -655,6 +668,20 @@ want to import these changes back into your regular working directory::
|
||||
$ gpg --export | gpg --homedir ~/.gnupg --import
|
||||
$ unset GNUPGHOME
|
||||
|
||||
Using gpg-agent over ssh
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can forward your gpg-agent over ssh if you need to sign tags or
|
||||
commits on a remote system. Please refer to the instructions provided
|
||||
on the GnuPG wiki:
|
||||
|
||||
- `Agent Forwarding over SSH`_
|
||||
|
||||
It works more smoothly if you can modify the sshd server settings on the
|
||||
remote end.
|
||||
|
||||
.. _`Agent Forwarding over SSH`: https://wiki.gnupg.org/AgentForwarding
|
||||
|
||||
|
||||
Using PGP with Git
|
||||
==================
|
||||
@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key)::
|
||||
tell git to always use it instead of the legacy ``gpg`` from version 1::
|
||||
|
||||
$ git config --global gpg.program gpg2
|
||||
$ git config --global gpgv.program gpgv2
|
||||
|
||||
How to work with signed tags
|
||||
----------------------------
|
||||
@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to
|
||||
import their PGP key. Please refer to the
|
||||
":ref:`verify_identities`" section below.
|
||||
|
||||
.. note::
|
||||
|
||||
If you get "``gpg: Can't check signature: unknown pubkey
|
||||
algorithm``" error, you need to tell git to use gpgv2 for
|
||||
verification, so it properly processes signatures made by ECC keys.
|
||||
See instructions at the start of this section.
|
||||
|
||||
Configure git to always sign annotated tags
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use
|
||||
the pull request as the cover letter for a normal posting of the patch
|
||||
series, giving the maintainer the option of using either.
|
||||
|
||||
A pull request should have [GIT] or [PULL] in the subject line. The
|
||||
A pull request should have [GIT PULL] in the subject line. The
|
||||
request itself should include the repository name and the branch of
|
||||
interest on a single line; it should look something like::
|
||||
|
||||
|
@ -9,5 +9,7 @@ Security Documentation
|
||||
IMA-templates
|
||||
keys/index
|
||||
LSM
|
||||
LSM-sctp
|
||||
SELinux-sctp
|
||||
self-protection
|
||||
tpm/index
|
||||
|
@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel
|
||||
ML (see the section `Links and Addresses`_).
|
||||
|
||||
``power_save`` and ``power_save_controller`` options are for power-saving
|
||||
mode. See powersave.txt for details.
|
||||
mode. See powersave.rst for details.
|
||||
|
||||
Note 2: If you get click noises on output, try the module option
|
||||
``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB
|
||||
@ -1133,7 +1133,7 @@ line_outs_monitor
|
||||
enable_monitor
|
||||
Enable Analog Out on Channel 63/64 by default.
|
||||
|
||||
See hdspm.txt for details.
|
||||
See hdspm.rst for details.
|
||||
|
||||
Module snd-ice1712
|
||||
------------------
|
||||
|
@ -139,7 +139,7 @@ DAPM description
|
||||
----------------
|
||||
The Dynamic Audio Power Management description describes the codec power
|
||||
components and their relationships and registers to the ASoC core.
|
||||
Please read dapm.txt for details of building the description.
|
||||
Please read dapm.rst for details of building the description.
|
||||
|
||||
Please also see the examples in other codec drivers.
|
||||
|
||||
|
@ -66,7 +66,7 @@ Each SoC DAI driver must provide the following features:-
|
||||
4. SYSCLK configuration
|
||||
5. Suspend and resume (optional)
|
||||
|
||||
Please see codec.txt for a description of items 1 - 4.
|
||||
Please see codec.rst for a description of items 1 - 4.
|
||||
|
||||
|
||||
SoC DSP Drivers
|
||||
|
@ -515,7 +515,7 @@ nr_hugepages
|
||||
|
||||
Change the minimum size of the hugepage pool.
|
||||
|
||||
See Documentation/vm/hugetlbpage.txt
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst
|
||||
|
||||
==============================================================
|
||||
|
||||
@ -524,7 +524,7 @@ nr_overcommit_hugepages
|
||||
Change the maximum size of the hugepage pool. The maximum is
|
||||
nr_hugepages + nr_overcommit_hugepages.
|
||||
|
||||
See Documentation/vm/hugetlbpage.txt
|
||||
See Documentation/admin-guide/mm/hugetlbpage.rst
|
||||
|
||||
==============================================================
|
||||
|
||||
@ -667,7 +667,7 @@ and don't use much of it.
|
||||
|
||||
The default value is 0.
|
||||
|
||||
See Documentation/vm/overcommit-accounting and
|
||||
See Documentation/vm/overcommit-accounting.rst and
|
||||
mm/mmap.c::__vm_enough_memory() for more information.
|
||||
|
||||
==============================================================
|
||||
|
@ -187,13 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The
|
||||
specific to that component only. "Implementation defined" customisations are
|
||||
expected to be accessed and controlled using those entries.
|
||||
|
||||
Last but not least, "struct module *owner" is expected to be set to reflect
|
||||
the information carried in "THIS_MODULE".
|
||||
|
||||
How to use the tracer modules
|
||||
-----------------------------
|
||||
|
||||
Before trace collection can start, a coresight sink needs to be identify.
|
||||
There are two ways to use the Coresight framework: 1) using the perf cmd line
|
||||
tools and 2) interacting directly with the Coresight devices using the sysFS
|
||||
interface. Preference is given to the former as using the sysFS interface
|
||||
requires a deep understanding of the Coresight HW. The following sections
|
||||
provide details on using both methods.
|
||||
|
||||
1) Using the sysFS interface:
|
||||
|
||||
Before trace collection can start, a coresight sink needs to be identified.
|
||||
There is no limit on the amount of sinks (nor sources) that can be enabled at
|
||||
any given moment. As a generic operation, all device pertaining to the sink
|
||||
class will have an "active" entry in sysfs:
|
||||
@ -298,42 +304,48 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD
|
||||
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
|
||||
Timestamp Timestamp: 17107041535
|
||||
|
||||
How to use the STM module
|
||||
-------------------------
|
||||
2) Using perf framework:
|
||||
|
||||
Using the System Trace Macrocell module is the same as the tracers - the only
|
||||
difference is that clients are driving the trace capture rather
|
||||
than the program flow through the code.
|
||||
Coresight tracers are represented using the Perf framework's Performance
|
||||
Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of
|
||||
controlling when tracing gets enabled based on when the process of interest is
|
||||
scheduled. When configured in a system, Coresight PMUs will be listed when
|
||||
queried by the perf command line tool:
|
||||
|
||||
As with any other CoreSight component, specifics about the STM tracer can be
|
||||
found in sysfs with more information on each entry being found in [1]:
|
||||
linaro@linaro-nano:~$ ./perf list pmu
|
||||
|
||||
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
|
||||
enable_source hwevent_select port_enable subsystem uevent
|
||||
hwevent_enable mgmt port_select traceid
|
||||
root@genericarmv8:~#
|
||||
List of pre-defined events (to be used in -e):
|
||||
|
||||
Like any other source a sink needs to be identified and the STM enabled before
|
||||
being used:
|
||||
cs_etm// [Kernel PMU event]
|
||||
|
||||
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
|
||||
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
|
||||
linaro@linaro-nano:~$
|
||||
|
||||
From there user space applications can request and use channels using the devfs
|
||||
interface provided for that purpose by the generic STM API:
|
||||
Regardless of the number of tracers available in a system (usually equal to the
|
||||
amount of processor cores), the "cs_etm" PMU will be listed only once.
|
||||
|
||||
root@genericarmv8:~# ls -l /dev/20100000.stm
|
||||
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
|
||||
root@genericarmv8:~#
|
||||
A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is
|
||||
listed along with configuration options within forward slashes '/'. Since a
|
||||
Coresight system will typically have more than one sink, the name of the sink to
|
||||
work with needs to be specified as an event option. Names for sink to choose
|
||||
from are listed in sysFS under ($SYSFS)/bus/coresight/devices:
|
||||
|
||||
Details on how to use the generic STM API can be found here [2].
|
||||
root@linaro-nano:~# ls /sys/bus/coresight/devices/
|
||||
20010000.etf 20040000.funnel 20100000.stm 22040000.etm
|
||||
22140000.etm 230c0000.funnel 23240000.etm 20030000.tpiu
|
||||
20070000.etr 20120000.replicator 220c0000.funnel
|
||||
23040000.etm 23140000.etm 23340000.etm
|
||||
|
||||
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
|
||||
[2]. Documentation/trace/stm.txt
|
||||
root@linaro-nano:~# perf record -e cs_etm/@20070000.etr/u --per-thread program
|
||||
|
||||
The syntax within the forward slashes '/' is important. The '@' character
|
||||
tells the parser that a sink is about to be specified and that this is the sink
|
||||
to use for the trace session.
|
||||
|
||||
Using perf tools
|
||||
----------------
|
||||
More information on the above and other example on how to use Coresight with
|
||||
the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
|
||||
repository [3].
|
||||
|
||||
2.1) AutoFDO analysis using the perf tools:
|
||||
|
||||
perf can be used to record and analyze trace of programs.
|
||||
|
||||
@ -381,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
|
||||
$ taskset -c 2 ./sort_autofdo
|
||||
Bubble sorting array of 30000 elements
|
||||
5806 ms
|
||||
|
||||
|
||||
How to use the STM module
|
||||
-------------------------
|
||||
|
||||
Using the System Trace Macrocell module is the same as the tracers - the only
|
||||
difference is that clients are driving the trace capture rather
|
||||
than the program flow through the code.
|
||||
|
||||
As with any other CoreSight component, specifics about the STM tracer can be
|
||||
found in sysfs with more information on each entry being found in [1]:
|
||||
|
||||
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
|
||||
enable_source hwevent_select port_enable subsystem uevent
|
||||
hwevent_enable mgmt port_select traceid
|
||||
root@genericarmv8:~#
|
||||
|
||||
Like any other source a sink needs to be identified and the STM enabled before
|
||||
being used:
|
||||
|
||||
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
|
||||
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
|
||||
|
||||
From there user space applications can request and use channels using the devfs
|
||||
interface provided for that purpose by the generic STM API:
|
||||
|
||||
root@genericarmv8:~# ls -l /dev/20100000.stm
|
||||
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
|
||||
root@genericarmv8:~#
|
||||
|
||||
Details on how to use the generic STM API can be found here [2].
|
||||
|
||||
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
|
||||
[2]. Documentation/trace/stm.txt
|
||||
[3]. https://github.com/Linaro/perf-opencsd
|
||||
|
@ -12,7 +12,7 @@ Written for: 4.14
|
||||
Introduction
|
||||
============
|
||||
|
||||
The ftrace infrastructure was originially created to attach callbacks to the
|
||||
The ftrace infrastructure was originally created to attach callbacks to the
|
||||
beginning of functions in order to record and trace the flow of the kernel.
|
||||
But callbacks to the start of a function can have other use cases. Either
|
||||
for live kernel patching, or for security monitoring. This document describes
|
||||
@ -30,7 +30,7 @@ The ftrace context
|
||||
This requires extra care to what can be done inside a callback. A callback
|
||||
can be called outside the protective scope of RCU.
|
||||
|
||||
The ftrace infrastructure has some protections agains recursions and RCU
|
||||
The ftrace infrastructure has some protections against recursions and RCU
|
||||
but one must still be very careful how they use the callbacks.
|
||||
|
||||
|
||||
|
@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files:
|
||||
has a side effect of enabling or disabling specific functions
|
||||
to be traced. Echoing names of functions into this file
|
||||
will limit the trace to only those functions.
|
||||
This influences the tracers "function" and "function_graph"
|
||||
and thus also function profiling (see "function_profile_enabled").
|
||||
|
||||
The functions listed in "available_filter_functions" are what
|
||||
can be written into this file.
|
||||
@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files:
|
||||
Functions listed in this file will cause the function graph
|
||||
tracer to only trace these functions and the functions that
|
||||
they call. (See the section "dynamic ftrace" for more details).
|
||||
Note, set_ftrace_filter and set_ftrace_notrace still affects
|
||||
what functions are being traced.
|
||||
|
||||
set_graph_notrace:
|
||||
|
||||
@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files:
|
||||
|
||||
This lists the functions that ftrace has processed and can trace.
|
||||
These are the function names that you can pass to
|
||||
"set_ftrace_filter" or "set_ftrace_notrace".
|
||||
"set_ftrace_filter", "set_ftrace_notrace",
|
||||
"set_graph_function", or "set_graph_notrace".
|
||||
(See the section "dynamic ftrace" below for more details.)
|
||||
|
||||
dyn_ftrace_total_info:
|
||||
|
@ -2846,7 +2846,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮
|
||||
문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는
|
||||
비트들을 무효화 시켜야 합니다.
|
||||
|
||||
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/cachetlb.txt 를
|
||||
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/core-api/cachetlb.rst 를
|
||||
참고하세요.
|
||||
|
||||
|
||||
@ -3023,7 +3023,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다.
|
||||
동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을
|
||||
위해선 다음을 참고하세요:
|
||||
|
||||
Documentation/circular-buffers.txt
|
||||
Documentation/core-api/circular-buffers.rst
|
||||
|
||||
|
||||
=========
|
||||
|
@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver,
|
||||
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
|
||||
respectively::
|
||||
|
||||
extern int vfio_add_group_dev(struct iommu_group *iommu_group,
|
||||
struct device *dev,
|
||||
extern int vfio_add_group_dev(struct device *dev,
|
||||
const struct vfio_device_ops *ops,
|
||||
void *device_data);
|
||||
|
||||
extern void *vfio_del_group_dev(struct device *dev);
|
||||
|
||||
vfio_add_group_dev() indicates to the core to begin tracking the
|
||||
specified iommu_group and register the specified dev as owned by
|
||||
iommu_group of the specified dev and register the dev as owned by
|
||||
a VFIO bus driver. The driver provides an ops structure for callbacks
|
||||
similar to a file operations structure::
|
||||
|
||||
|
@ -1,62 +1,50 @@
|
||||
00-INDEX
|
||||
- this file.
|
||||
active_mm.txt
|
||||
active_mm.rst
|
||||
- An explanation from Linus about tsk->active_mm vs tsk->mm.
|
||||
balance
|
||||
balance.rst
|
||||
- various information on memory balancing.
|
||||
cleancache.txt
|
||||
cleancache.rst
|
||||
- Intro to cleancache and page-granularity victim cache.
|
||||
frontswap.txt
|
||||
frontswap.rst
|
||||
- Outline frontswap, part of the transcendent memory frontend.
|
||||
highmem.txt
|
||||
highmem.rst
|
||||
- Outline of highmem and common issues.
|
||||
hmm.txt
|
||||
hmm.rst
|
||||
- Documentation of heterogeneous memory management
|
||||
hugetlbpage.txt
|
||||
- a brief summary of hugetlbpage support in the Linux kernel.
|
||||
hugetlbfs_reserv.txt
|
||||
hugetlbfs_reserv.rst
|
||||
- A brief overview of hugetlbfs reservation design/implementation.
|
||||
hwpoison.txt
|
||||
hwpoison.rst
|
||||
- explains what hwpoison is
|
||||
idle_page_tracking.txt
|
||||
- description of the idle page tracking feature.
|
||||
ksm.txt
|
||||
ksm.rst
|
||||
- how to use the Kernel Samepage Merging feature.
|
||||
mmu_notifier.txt
|
||||
mmu_notifier.rst
|
||||
- a note about clearing pte/pmd and mmu notifications
|
||||
numa
|
||||
numa.rst
|
||||
- information about NUMA specific code in the Linux vm.
|
||||
numa_memory_policy.txt
|
||||
- documentation of concepts and APIs of the 2.6 memory policy support.
|
||||
overcommit-accounting
|
||||
overcommit-accounting.rst
|
||||
- description of the Linux kernels overcommit handling modes.
|
||||
page_frags
|
||||
page_frags.rst
|
||||
- description of page fragments allocator
|
||||
page_migration
|
||||
page_migration.rst
|
||||
- description of page migration in NUMA systems.
|
||||
pagemap.txt
|
||||
- pagemap, from the userspace perspective
|
||||
page_owner.txt
|
||||
page_owner.rst
|
||||
- tracking about who allocated each page
|
||||
remap_file_pages.txt
|
||||
remap_file_pages.rst
|
||||
- a note about remap_file_pages() system call
|
||||
slub.txt
|
||||
slub.rst
|
||||
- a short users guide for SLUB.
|
||||
soft-dirty.txt
|
||||
- short explanation for soft-dirty PTEs
|
||||
split_page_table_lock
|
||||
split_page_table_lock.rst
|
||||
- Separate per-table lock to improve scalability of the old page_table_lock.
|
||||
swap_numa.txt
|
||||
swap_numa.rst
|
||||
- automatic binding of swap device to numa node
|
||||
transhuge.txt
|
||||
transhuge.rst
|
||||
- Transparent Hugepage Support, alternative way of using hugepages.
|
||||
unevictable-lru.txt
|
||||
unevictable-lru.rst
|
||||
- Unevictable LRU infrastructure
|
||||
userfaultfd.txt
|
||||
- description of userfaultfd system call
|
||||
z3fold.txt
|
||||
- outline of z3fold allocator for storing compressed pages
|
||||
zsmalloc.txt
|
||||
zsmalloc.rst
|
||||
- outline of zsmalloc allocator for storing compressed pages
|
||||
zswap.txt
|
||||
zswap.rst
|
||||
- Intro to compressed cache for swap pages
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user