mirror of
https://github.com/AuxXxilium/linux_dsm_epyc7002.git
synced 2025-01-24 19:59:25 +07:00
0b405a0f7e
The driver model has a "detach_state" mechanism that: - Has never been used by any in-kernel drive; - Is superfluous, since driver remove() methods can do the same thing; - Became buggy when the suspend() parameter changed semantics and type; - Could self-deadlock when called from certain suspend contexts; - Is effectively wasted documentation, object code, and headspace. This removes that "detach_state" mechanism; net code shrink, as well as a per-device saving in the driver model and sysfs. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
299 lines
12 KiB
Plaintext
299 lines
12 KiB
Plaintext
|
|
Device Power Management
|
|
|
|
|
|
Device power management encompasses two areas - the ability to save
|
|
state and transition a device to a low-power state when the system is
|
|
entering a low-power state; and the ability to transition a device to
|
|
a low-power state while the system is running (and independently of
|
|
any other power management activity).
|
|
|
|
|
|
Methods
|
|
|
|
The methods to suspend and resume devices reside in struct bus_type:
|
|
|
|
struct bus_type {
|
|
...
|
|
int (*suspend)(struct device * dev, pm_message_t state);
|
|
int (*resume)(struct device * dev);
|
|
};
|
|
|
|
Each bus driver is responsible implementing these methods, translating
|
|
the call into a bus-specific request and forwarding the call to the
|
|
bus-specific drivers. For example, PCI drivers implement suspend() and
|
|
resume() methods in struct pci_driver. The PCI core is simply
|
|
responsible for translating the pointers to PCI-specific ones and
|
|
calling the low-level driver.
|
|
|
|
This is done to a) ease transition to the new power management methods
|
|
and leverage the existing PM code in various bus drivers; b) allow
|
|
buses to implement generic and default PM routines for devices, and c)
|
|
make the flow of execution obvious to the reader.
|
|
|
|
|
|
System Power Management
|
|
|
|
When the system enters a low-power state, the device tree is walked in
|
|
a depth-first fashion to transition each device into a low-power
|
|
state. The ordering of the device tree is guaranteed by the order in
|
|
which devices get registered - children are never registered before
|
|
their ancestors, and devices are placed at the back of the list when
|
|
registered. By walking the list in reverse order, we are guaranteed to
|
|
suspend devices in the proper order.
|
|
|
|
Devices are suspended once with interrupts enabled. Drivers are
|
|
expected to stop I/O transactions, save device state, and place the
|
|
device into a low-power state. Drivers may sleep, allocate memory,
|
|
etc. at will.
|
|
|
|
Some devices are broken and will inevitably have problems powering
|
|
down or disabling themselves with interrupts enabled. For these
|
|
special cases, they may return -EAGAIN. This will put the device on a
|
|
list to be taken care of later. When interrupts are disabled, before
|
|
we enter the low-power state, their drivers are called again to put
|
|
their device to sleep.
|
|
|
|
On resume, the devices that returned -EAGAIN will be called to power
|
|
themselves back on with interrupts disabled. Once interrupts have been
|
|
re-enabled, the rest of the drivers will be called to resume their
|
|
devices. On resume, a driver is responsible for powering back on each
|
|
device, restoring state, and re-enabling I/O transactions for that
|
|
device.
|
|
|
|
System devices follow a slightly different API, which can be found in
|
|
|
|
include/linux/sysdev.h
|
|
drivers/base/sys.c
|
|
|
|
System devices will only be suspended with interrupts disabled, and
|
|
after all other devices have been suspended. On resume, they will be
|
|
resumed before any other devices, and also with interrupts disabled.
|
|
|
|
|
|
Runtime Power Management
|
|
|
|
Many devices are able to dynamically power down while the system is
|
|
still running. This feature is useful for devices that are not being
|
|
used, and can offer significant power savings on a running system.
|
|
|
|
In each device's directory, there is a 'power' directory, which
|
|
contains at least a 'state' file. Reading from this file displays what
|
|
power state the device is currently in. Writing to this file initiates
|
|
a transition to the specified power state, which must be a decimal in
|
|
the range 1-3, inclusive; or 0 for 'On'.
|
|
|
|
The PM core will call the ->suspend() method in the bus_type object
|
|
that the device belongs to if the specified state is not 0, or
|
|
->resume() if it is.
|
|
|
|
Nothing will happen if the specified state is the same state the
|
|
device is currently in.
|
|
|
|
If the device is already in a low-power state, and the specified state
|
|
is another, but different, low-power state, the ->resume() method will
|
|
first be called to power the device back on, then ->suspend() will be
|
|
called again with the new state.
|
|
|
|
The driver is responsible for saving the working state of the device
|
|
and putting it into the low-power state specified. If this was
|
|
successful, it returns 0, and the device's power_state field is
|
|
updated.
|
|
|
|
The driver must take care to know whether or not it is able to
|
|
properly resume the device, including all step of reinitialization
|
|
necessary. (This is the hardest part, and the one most protected by
|
|
NDA'd documents).
|
|
|
|
The driver must also take care not to suspend a device that is
|
|
currently in use. It is their responsibility to provide their own
|
|
exclusion mechanisms.
|
|
|
|
The runtime power transition happens with interrupts enabled. If a
|
|
device cannot support being powered down with interrupts, it may
|
|
return -EAGAIN (as it would during a system power management
|
|
transition), but it will _not_ be called again, and the transaction
|
|
will fail.
|
|
|
|
There is currently no way to know what states a device or driver
|
|
supports a priori. This will change in the future.
|
|
|
|
pm_message_t meaning
|
|
|
|
pm_message_t has two fields. event ("major"), and flags. If driver
|
|
does not know event code, it aborts the request, returning error. Some
|
|
drivers may need to deal with special cases based on the actual type
|
|
of suspend operation being done at the system level. This is why
|
|
there are flags.
|
|
|
|
Event codes are:
|
|
|
|
ON -- no need to do anything except special cases like broken
|
|
HW.
|
|
|
|
# NOTIFICATION -- pretty much same as ON?
|
|
|
|
FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from
|
|
scratch. That probably means stop accepting upstream requests, the
|
|
actual policy of what to do with them beeing specific to a given
|
|
driver. It's acceptable for a network driver to just drop packets
|
|
while a block driver is expected to block the queue so no request is
|
|
lost. (Use IDE as an example on how to do that). FREEZE requires no
|
|
power state change, and it's expected for drivers to be able to
|
|
quickly transition back to operating state.
|
|
|
|
SUSPEND -- like FREEZE, but also put hardware into low-power state. If
|
|
there's need to distinguish several levels of sleep, additional flag
|
|
is probably best way to do that.
|
|
|
|
Transitions are only from a resumed state to a suspended state, never
|
|
between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen,
|
|
FREEZE -> SUSPEND or SUSPEND -> FREEZE can not).
|
|
|
|
All events are:
|
|
|
|
[NOTE NOTE NOTE: If you are driver author, you should not care; you
|
|
should only look at event, and ignore flags.]
|
|
|
|
#Prepare for suspend -- userland is still running but we are going to
|
|
#enter suspend state. This gives drivers chance to load firmware from
|
|
#disk and store it in memory, or do other activities taht require
|
|
#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these
|
|
#are forbiden once the suspend dance is started.. event = ON, flags =
|
|
#PREPARE_TO_SUSPEND
|
|
|
|
Apm standby -- prepare for APM event. Quiesce devices to make life
|
|
easier for APM BIOS. event = FREEZE, flags = APM_STANDBY
|
|
|
|
Apm suspend -- same as APM_STANDBY, but it we should probably avoid
|
|
spinning down disks. event = FREEZE, flags = APM_SUSPEND
|
|
|
|
System halt, reboot -- quiesce devices to make life easier for BIOS. event
|
|
= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT
|
|
|
|
System shutdown -- at least disks need to be spun down, or data may be
|
|
lost. Quiesce devices, just to make life easier for BIOS. event =
|
|
FREEZE, flags = SYSTEM_SHUTDOWN
|
|
|
|
Kexec -- turn off DMAs and put hardware into some state where new
|
|
kernel can take over. event = FREEZE, flags = KEXEC
|
|
|
|
Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake
|
|
may need to be enabled on some devices. This actually has at least 3
|
|
subtypes, system can reboot, enter S4 and enter S5 at the end of
|
|
swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT,
|
|
SYSTEM_SHUTDOWN, SYSTEM_S4
|
|
|
|
Suspend to ram -- put devices into low power state. event = SUSPEND,
|
|
flags = SUSPEND_TO_RAM
|
|
|
|
Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put
|
|
devices into low power mode, but you must be able to reinitialize
|
|
device from scratch in resume method. This has two flavors, its done
|
|
once on suspending kernel, once on resuming kernel. event = FREEZE,
|
|
flags = DURING_SUSPEND or DURING_RESUME
|
|
|
|
Device detach requested from /sys -- deinitialize device; proably same as
|
|
SYSTEM_SHUTDOWN, I do not understand this one too much. probably event
|
|
= FREEZE, flags = DEV_DETACH.
|
|
|
|
#These are not really events sent:
|
|
#
|
|
#System fully on -- device is working normally; this is probably never
|
|
#passed to suspend() method... event = ON, flags = 0
|
|
#
|
|
#Ready after resume -- userland is now running, again. Time to free any
|
|
#memory you ate during prepare to suspend... event = ON, flags =
|
|
#READY_AFTER_RESUME
|
|
#
|
|
|
|
|
|
pm_message_t meaning
|
|
|
|
pm_message_t has two fields. event ("major"), and flags. If driver
|
|
does not know event code, it aborts the request, returning error. Some
|
|
drivers may need to deal with special cases based on the actual type
|
|
of suspend operation being done at the system level. This is why
|
|
there are flags.
|
|
|
|
Event codes are:
|
|
|
|
ON -- no need to do anything except special cases like broken
|
|
HW.
|
|
|
|
# NOTIFICATION -- pretty much same as ON?
|
|
|
|
FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from
|
|
scratch. That probably means stop accepting upstream requests, the
|
|
actual policy of what to do with them being specific to a given
|
|
driver. It's acceptable for a network driver to just drop packets
|
|
while a block driver is expected to block the queue so no request is
|
|
lost. (Use IDE as an example on how to do that). FREEZE requires no
|
|
power state change, and it's expected for drivers to be able to
|
|
quickly transition back to operating state.
|
|
|
|
SUSPEND -- like FREEZE, but also put hardware into low-power state. If
|
|
there's need to distinguish several levels of sleep, additional flag
|
|
is probably best way to do that.
|
|
|
|
Transitions are only from a resumed state to a suspended state, never
|
|
between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen,
|
|
FREEZE -> SUSPEND or SUSPEND -> FREEZE can not).
|
|
|
|
All events are:
|
|
|
|
[NOTE NOTE NOTE: If you are driver author, you should not care; you
|
|
should only look at event, and ignore flags.]
|
|
|
|
#Prepare for suspend -- userland is still running but we are going to
|
|
#enter suspend state. This gives drivers chance to load firmware from
|
|
#disk and store it in memory, or do other activities taht require
|
|
#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these
|
|
#are forbiden once the suspend dance is started.. event = ON, flags =
|
|
#PREPARE_TO_SUSPEND
|
|
|
|
Apm standby -- prepare for APM event. Quiesce devices to make life
|
|
easier for APM BIOS. event = FREEZE, flags = APM_STANDBY
|
|
|
|
Apm suspend -- same as APM_STANDBY, but it we should probably avoid
|
|
spinning down disks. event = FREEZE, flags = APM_SUSPEND
|
|
|
|
System halt, reboot -- quiesce devices to make life easier for BIOS. event
|
|
= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT
|
|
|
|
System shutdown -- at least disks need to be spun down, or data may be
|
|
lost. Quiesce devices, just to make life easier for BIOS. event =
|
|
FREEZE, flags = SYSTEM_SHUTDOWN
|
|
|
|
Kexec -- turn off DMAs and put hardware into some state where new
|
|
kernel can take over. event = FREEZE, flags = KEXEC
|
|
|
|
Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake
|
|
may need to be enabled on some devices. This actually has at least 3
|
|
subtypes, system can reboot, enter S4 and enter S5 at the end of
|
|
swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT,
|
|
SYSTEM_SHUTDOWN, SYSTEM_S4
|
|
|
|
Suspend to ram -- put devices into low power state. event = SUSPEND,
|
|
flags = SUSPEND_TO_RAM
|
|
|
|
Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put
|
|
devices into low power mode, but you must be able to reinitialize
|
|
device from scratch in resume method. This has two flavors, its done
|
|
once on suspending kernel, once on resuming kernel. event = FREEZE,
|
|
flags = DURING_SUSPEND or DURING_RESUME
|
|
|
|
Device detach requested from /sys -- deinitialize device; proably same as
|
|
SYSTEM_SHUTDOWN, I do not understand this one too much. probably event
|
|
= FREEZE, flags = DEV_DETACH.
|
|
|
|
#These are not really events sent:
|
|
#
|
|
#System fully on -- device is working normally; this is probably never
|
|
#passed to suspend() method... event = ON, flags = 0
|
|
#
|
|
#Ready after resume -- userland is now running, again. Time to free any
|
|
#memory you ate during prepare to suspend... event = ON, flags =
|
|
#READY_AFTER_RESUME
|
|
#
|