linux_dsm_epyc7002/Documentation/ABI/testing/sysfs-memory-page-offline
Andi Kleen facb6011f3 HWPOISON: Add soft page offline support
This is a simpler, gentler variant of memory_failure() for soft page
offlining controlled from user space.  It doesn't kill anything, just
tries to invalidate and if that doesn't work migrate the
page away.

This is useful for predictive failure analysis, where a page has
a high rate of corrected errors, but hasn't gone bad yet. Instead
it can be offlined early and avoided.

The offlining is controlled from sysfs, including a new generic
entry point for hard page offlining for symmetry too.

We use the page isolate facility to prevent re-allocation
race. Normally this is only used by memory hotplug. To avoid
races with memory allocation I am using lock_system_sleep().
This avoids the situation where memory hotplug is about
to isolate a page range and then hwpoison undoes that work.
This is a big hammer currently, but the simplest solution
currently.

When the page is not free or LRU we try to free pages
from slab and other caches. The slab freeing is currently
quite dumb and does not try to focus on the specific slab
cache which might own the page. This could be potentially
improved later.

Thanks to Fengguang Wu and Haicheng Li for some fixes.

[Added fix from Andrew Morton to adapt to new migrate_pages prototype]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-16 12:20:00 +01:00

45 lines
1.6 KiB
Plaintext

What: /sys/devices/system/memory/soft_offline_page
Date: Sep 2009
KernelVersion: 2.6.33
Contact: andi@firstfloor.org
Description:
Soft-offline the memory page containing the physical address
written into this file. Input is a hex number specifying the
physical address of the page. The kernel will then attempt
to soft-offline it, by moving the contents elsewhere or
dropping it if possible. The kernel will then be placed
on the bad page list and never be reused.
The offlining is done in kernel specific granuality.
Normally it's the base page size of the kernel, but
this might change.
The page must be still accessible, not poisoned. The
kernel will never kill anything for this, but rather
fail the offline. Return value is the size of the
number, or a error when the offlining failed. Reading
the file is not allowed.
What: /sys/devices/system/memory/hard_offline_page
Date: Sep 2009
KernelVersion: 2.6.33
Contact: andi@firstfloor.org
Description:
Hard-offline the memory page containing the physical
address written into this file. Input is a hex number
specifying the physical address of the page. The
kernel will then attempt to hard-offline the page, by
trying to drop the page or killing any owner or
triggering IO errors if needed. Note this may kill
any processes owning the page. The kernel will avoid
to access this page assuming it's poisoned by the
hardware.
The offlining is done in kernel specific granuality.
Normally it's the base page size of the kernel, but
this might change.
Return value is the size of the number, or a error when
the offlining failed.
Reading the file is not allowed.