Now that the kernel provides DIV_ROUND_CLOSEST_ULL(), drop the internal
implementation and use the kernel one.
Signed-off-by: Javi Merino <javi.merino@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the ladder governor sees the CPUIDLE_FLAG_TIME_INVALID flag,
it unconditionally causes a state promotion by setting last_residency
to a number higher than the state's promotion_time:
last_residency = last_state->threshold.promotion_time + 1
It does this for fear that cpuidle_get_last_residency()
will be in-accurate, because cpuidle_enter_state() invoked
a state with CPUIDLE_FLAG_TIME_INVALID.
But the only state with CPUIDLE_FLAG_TIME_INVALID is
acpi_safe_halt(), which may return well after its actual
idle duration because it enables interrupts, so cpuidle_enter_state()
also measures interrupt service time.
So what? In ladder, a huge invalid last_residency has exactly
the same effect as the current code -- it unconditionally
causes a state promotion.
In the case where the idle residency plus measured interrupt
handling time is less than the state's demotion_time -- we should
use that timestamp to give ladder a chance to demote, rather than
unconditionally promoting.
This can be done by simply ignoring the CPUIDLE_FLAG_TIME_INVALID,
and using the "invalid" time, as it is either equal to what we are
doing today, or better.
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When menu sees CPUIDLE_FLAG_TIME_INVALID, it ignores its timestamps,
and assumes that idle lasted as long as the time till next predicted
timer expiration.
But if an interrupt was seen and serviced before that duration,
it would actually be more accurate to use the measured time
rather than rounding up to the next predicted timer expiration.
And if an interrupt is seen and serviced such that the mesured time
exceeds the time till next predicted timer expiration, then
truncating to that expiration is the right thing to do --
since we can never stay idle past that timer expiration.
So the code can do a better job without
checking for CPUIDLE_FLAG_TIME_INVALID.
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The only place where the time is invalid is when the ACPI_CSTATE_FFH entry
method is not set. Otherwise for all the drivers, the time can be correctly
measured.
Instead of duplicating the CPUIDLE_FLAG_TIME_VALID flag in all the drivers
for all the states, just invert the logic by replacing it by the flag
CPUIDLE_FLAG_TIME_INVALID, hence we can set this flag only for the acpi idle
driver, remove the former flag from all the drivers and invert the logic with
this flag in the different governor.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All of these are for address calculation. Replace with
this_cpu_ptr().
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
[cpufreq changes]
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Fix for an ACPI-based device hotplug regression introduced in 3.14
that causes a kernel panic to trigger when memory hot-remove is
attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen.
- Fix for a cpufreq regression introduced in 3.16 that triggers a
"sleeping function called from invalid context" bug in
dev_pm_opp_init_cpufreq_table() from Stephen Boyd.
- ACPI battery driver fix for a warning message added in 3.16 that
prints silly stuff sometimes from Mariusz Ceier.
- Hibernation fix for safer handling of mismatches in the 820 memory
map between the configurations during image creation and during
the subsequent restore from Chun-Yi Lee.
- ACPI processor driver fix to handle CPU hotplug notifications
correctly during system suspend/resume from Lan Tianyu.
- Series of four cpuidle menu governor cleanups that also should
speed it up a bit from Mel Gorman.
- Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little
cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus Pargmann
and Uwe Kleine-König.
- Version 3.0 of the analyze_suspend.py suspend profiling tool
from Todd E Brandt.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJT7UnNAAoJEILEb/54YlRxcxIP/ROFeak3+5tt3hkvZCevxpUh
AMPccgUoqsF2dognO3pcR4AgGP+meM6Qw0zBjPDNx6oo87hw7P1HlngfaRPHnWPh
iAkY2p1QhGAZW29vqxqBIdLVP9M+Nje0tvOX8/6QEsQgo2y6YCbJU0zITmvb8Tsk
183cXiz6xXDezt4sPeIVg2QVfngVFtOeNVgHDIhldQSF6zUQJP/3+BVutvaj3olt
2O3qpNfwJjFh9p6LWQ+CAalq3hJyNZ6ettLNCvudeq4kqRo49WAdjHaRW+qju/NR
dWybO29MfviczABVQ1ReqSnz0MJOqhZNxkEi5KqnYBb3fx8e2XffsBFzFzTp6BJi
bp4ALcFIu9r5ctWVxQhmgEC6uhYMIXZ681sH99HyIdzk2cNRgMxRj6u2aVe/Cczu
Bb489CRHmOrZyXrkmENg+LkOYBNoXcT+RepH9Ex8R+TNBlKLEBKMMgPrfbFeVKWB
Vm621tHNATJG8nJcs3zJulM2FQ0q8c2irw6WwhUxzbSOxmqSvO5zN3OgYt+c+gWk
MmA8IhUpQBLkqBx1FMi0lOOdIW3qKZJFrU39VQEjoP4P1nXgf373NPlfgzMvEvqM
qQ8srMKFUjYxH3g0ftWk5a2MwEjyHQpvZe0djsMCN7ZkFLwUe1ri/R9Ja2LLQcIZ
SyVkFbbO+moXTRMA1yA9
=kpiw
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
"These are a couple of regression fixes, cpuidle menu governor
optimizations, fixes for ACPI proccessor and battery drivers,
hibernation fix to avoid problems related to the e820 memory map,
fixes for a few cpufreq drivers and a new version of the suspend
profiling tool analyze_suspend.py.
Specifics:
- Fix for an ACPI-based device hotplug regression introduced in 3.14
that causes a kernel panic to trigger when memory hot-remove is
attempted with CONFIG_ACPI_HOTPLUG_MEMORY unset from Tang Chen
- Fix for a cpufreq regression introduced in 3.16 that triggers a
"sleeping function called from invalid context" bug in
dev_pm_opp_init_cpufreq_table() from Stephen Boyd
- ACPI battery driver fix for a warning message added in 3.16 that
prints silly stuff sometimes from Mariusz Ceier
- Hibernation fix for safer handling of mismatches in the 820 memory
map between the configurations during image creation and during the
subsequent restore from Chun-Yi Lee
- ACPI processor driver fix to handle CPU hotplug notifications
correctly during system suspend/resume from Lan Tianyu
- Series of four cpuidle menu governor cleanups that also should
speed it up a bit from Mel Gorman
- Fixes for the speedstep-smi, integrator, cpu0 and arm_big_little
cpufreq drivers from Hans Wennborg, Himangi Saraogi, Markus
Pargmann and Uwe Kleine-König
- Version 3.0 of the analyze_suspend.py suspend profiling tool from
Todd E Brandt"
* tag 'pm+acpi-3.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / battery: Fix warning message in acpi_battery_get_state()
PM / tools: analyze_suspend.py: update to v3.0
cpufreq: arm_big_little: fix module license spec
cpufreq: speedstep-smi: fix decimal printf specifiers
ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove()
cpufreq: OPP: Avoid sleeping while atomic
cpufreq: cpu0: Do not print error message when deferring
cpufreq: integrator: Use set_cpus_allowed_ptr
PM / hibernate: avoid unsafe pages in e820 reserved regions
ACPI / processor: Make acpi_cpu_soft_notify() process CPU FROZEN events
cpuidle: menu: Lookup CPU runqueues less
cpuidle: menu: Call nr_iowait_cpu less times
cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
cpuidle: menu: Use shifts when calculating averages where possible
Pull trivial tree changes from Jiri Kosina:
"Summer edition of trivial tree updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: fix two typos in watchdog-api.txt
irq-gic: remove file name from heading comment
MAINTAINERS: Add miscdevice.h to file list for char/misc drivers.
scsi: mvsas: mv_sas.c: Fix for possible null pointer dereference
doc: replace "practise" with "practice" in Documentation
befs: remove check for CONFIG_BEFS_RW
scsi: doc: fix 'SCSI_NCR_SETUP_MASTER_PARITY'
drivers/usb/phy/phy.c: remove a leading space
mfd: fix comment
cpuidle: fix comment
doc: hpfall.c: fix missing null-terminate after strncpy call
usb: doc: hotplug.txt code typos
kbuild: fix comment in Makefile.modinst
SH: add proper prompt to SH_MAGIC_PANEL_R2_VERSION
ARM: msm: Remove MSM_SCM
crypto: Remove MPILIB_EXTRA
doc: CN: remove dead link, kerneltrap.org no longer works
media: update reference, kerneltrap.org no longer works
hexagon: update reference, kerneltrap.org no longer works
doc: LSM: update reference, kerneltrap.org no longer works
...
The menu governer makes separate lookups of the CPU runqueue to get
load and number of IO waiters but it can be done with a single lookup.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
menu_select() via inline functions calls nr_iowait_cpu() twice as much
as necessary.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ktime_to_us implementation is slightly better than the one implemented
in menu.c. Use it
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We use do_div even though the divisor will usually be a power-of-two
unless there are unusual outliers. Use shifts where possible
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
use CPUIDLE_DRIVER_STATE_START, instead of hardcoded value 0. As,
CPUIDLE_DRIVER_STATE_START can be 1/0 based on
CONFIG_ARCH_HAS_CPU_RELAX is defined or not.
Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
STDDEV_THRESH was once defined and used in menu governor. But now its no longer
used anywhere. So removing the define.
Signed-off-by: Mohammad Merajul Islam Molla <meraj.enigma@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In menu_select function we check for correction factor every time.
If it is zero we are initializing to unity. Hence move it to init function
and initialise by unity, hence avoid repeated comparisons.
Signed-off-by: Chander Kashyap <chander.kashyap@linaro.org>
Reviewed-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If there is a PM QoS latency limit and all of the sufficiently shallow
C-states are disabled, the cpuidle menu governor returns 0 which on
some systems is CPUIDLE_DRIVER_STATE_START and shouldn't be returned
if that C-state has been disabled.
Fix the issue by modifying the menu governor to return (-1) in such
situations.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor performance multiplier defines a minimum predicted
idle duration to latency ratio. Instead of checking this separately
in every iteration of the state selection loop, adjust the overall
latency restriction for the whole loop if this restriction is tighter
than what is set by the QoS subsystem.
The original test
s->exit_latency * multiplier > data->predicted_us
becomes
s->exit_latency > data->predicted_us / multiplier
by dividing both sides of the comparison by "multiplier".
While division is likely to be several times slower than multiplication,
the minor performance hit allows making a generic sleep state selection
function based on (sleep duration, maximum latency) tuple.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor statistics update function tries to determine the
amount of time between entry to low power state and the occurrence
of the wakeup event. However, the time measured by the framework
includes exit latency on top of the desired value. This exit latency
is substracted from the measured value to obtain the desired value.
When measured value is not available, the menu governor assumes
the wakeup was caused by the timer and the time is equal to remaining
timer length. No exit latency should be substracted from this value.
This patch prevents the erroneous substraction and clarifies the
associated comment. It also removes one intermediate variable that
serves no purpose.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor uses coefficients as one method of actual idle
period length estimation. The coefficients are, as detailed below,
multipliers giving expected idle period length from time until next
timer expiry. The multipliers are supposed to have domain of (0..1].
The coefficients are fractions where only the numerators are stored
and denominators are a shared constant RESOLUTION*DECAY. Since the
value of the coefficient should always be greater than 0 and less
than or equal to 1, the numerator must have a value greater than
0 and less than or equal to RESOLUTION*DECAY.
If the coefficients are updated with measured idle durations exceeding
timer length, the multiplier may reach values exceeding unity (i.e.
the stored numerator exceeds RESOLUTION*DECAY). This patch ensures that
the multipliers are updated with durations capped to timer length.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently menu governor records the exit latency of the state it has
chosen for the idle period. The stored latency value is then later
used to calculate the actual length of the idle period. This value
may however be incorrect, as the entered state may not be the one
chosen by the governor. The entered state information is available,
so we can use that to obtain the real exit latency.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The field expected_us is used to store the time remaining until next
timer expiry. The name is inaccurate, as we really do not expect all
wakeups to be caused by timers. In addition, another field with a very
similar name (predicted_us) is used to store the predicted time
remaining until any wakeup source being active.
This patch renames expected_us to next_timer_us in order to better
reflect the contained information.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Field predicted_us value can never exceed expected_us value, but it has
a potentially larger type. As there is no need for additional 32 bits of
zeroes on 32 bit plaforms, change the type of predicted_us to match the
type of expected_us.
Field correction_factor is used to store a value that cannot exceed the
product of RESOLUTION and DECAY (default 1024*8 = 8192). The constants
cannot in practice be incremented to such values, that they'd overflow
unsigned int even on 32 bit systems, so the type is changed to avoid
unnecessary 64 bit arithmetic on 32 bit systems.
One multiplication of (now) 32 bit values needs an added cast to avoid
truncation of the result and has been added.
In order to avoid another multiplication from 32 bit domain to 64 bit
domain, the new correction_factor calculation has been changed from
new = old * (DECAY-1) / DECAY
to
new = old - old / DECAY,
which with infinite precision would yeild exactly the same result, but
now changes the direction of rounding. The impact is not significant as
the maximum accumulated difference cannot exceed the value of DECAY,
which is relatively small compared to product of RESOLUTION and DECAY
(8 / 8192).
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor has a number of tunable constants that may be changed
in the source. If certain combination of values are chosen, an overflow
is possible when the correction_factor is being recalculated.
This patch adds a warning regarding this possibility and describes the
change needed for fixing the issue. The change should not be permanently
enabled, as it will hurt performance when it is not needed.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor uses a static function get_typical_interval() to
try to detect a repeating pattern of wakeups. The previous interval
durations are stored as an array of unsigned ints, but the arithmetic
in the function is performed exclusively as 64 bit values, even when
the value stored in a variable is known not to exceed unsigned int,
which may be smaller and more efficient on some platforms.
This patch changes the types of varibles used to store some
intermediates, the maximum and and the cutoff threshold to unsigned
ints. Average and standard deviation are still treated as 64 bit values,
even when the values are known to be within the domain of unsigned int,
to avoid casts to ensure correct integer promotion for arithmetic
operations.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Struct menu_device member intervals is declared as u32, but the value
stored is (unsigned) int. The type is changed to match the value being
stored.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function get_typical_interval() initializes a number of variables
that are immediately after declarations assigned constant values.
In addition, there are multiple assignments on a single line, which
is explicitly forbidden by Documentation/CodingStyle.
This patch removes redundant initial values for the variables and
breaks up the multiple assignment line.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
get_typical_interval() uses int_sqrt() in calculation of standard
deviation. The formal parameter of int_sqrt() is unsigned long, which
may on some platforms be smaller than the 64 bit unsigned integer used
as the actual parameter. The overflow can occur frequently when actual
idle period lengths are in hundreds of milliseconds.
This patch adds a check for such overflow and rejects the candidate
average when an overflow would occur.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch rearranges a if-return-elsif-goto-fi-return sequence into
if-return-fi-if-return-fi-goto sequence. The functionality remains the
same. Also, a lengthy comment that did not describe the functionality
in the order it occurs is split into half and top half is moved closer
to actual implementation it describes.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch prevents cpuidle menu governor from using repeating interval
prediction result if the idle period predicted is longer than the one
allowed by shortest running timer.
Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit e11538d1 (cpuidle: Quickly notice prediction failure in
general case), since it depends on commit 69a37be (cpuidle: Quickly
notice prediction failure for repeat mode) that has been identified
as the source of a significant performance regression in v3.8 and
later.
Requested-by: Jeremy Eder <jeder@redhat.com>
Tested-by: Len Brown <len.brown@intel.com>
Cc: 3.8+ <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq governors are defined as modules in the code, but the Kconfig
options do not allow them to be built as modules. This is not really
a problem, but the cpuidle init ordering is: the cpuidle init
functions (framework and driver) and then the governors. That leads
to some weirdness in the cpuidle framework.
Namely, cpuidle_register_device() calls cpuidle_enable_device() which
fails at the first attempt, because governors have not been registered
yet. When a governor is registered, the framework calls
cpuidle_enable_device() again which runs __cpuidle_register_device()
only then. Of course, for that to work, the cpuidle_enable_device()
return value has to be ignored by cpuidle_register_device().
Instead of having this cyclic call graph and relying on a positive
side effects of the hackish back and forth cpuidle_enable_device()
calls it is better to fix the cpuidle init ordering.
To that end, replace the module init code with postcore_initcall()
so we have:
* cpuidle framework : core_initcall
* cpuidle governors : postcore_initcall
* cpuidle drivers : device_initcall
and remove the corresponding module exit code as it is dead anyway
(governors can't be built as modules).
[rjw: Changelog]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We realized that the power usage field is never filled and when it
is filled for tegra, the power_specified flag is not set causing all
of these values to be reset when the driver is initialized with
set_power_state().
However, the power_specified flag can be simply removed under the
assumption that the states are always backward sorted, which is the
case with the current code.
This change allows the menu governor select function and the
cpuidle_play_dead() to be simplified. Moreover, the
set_power_states() function can removed as it does not make sense
any more.
Drop the power_specified flag from struct cpuidle_driver and make
the related changes as described above.
As a consequence, this also fixes the bug where on the dynamic
C-states system, the power fields are not initialized.
[rjw: Changelog]
References: https://bugzilla.kernel.org/show_bug.cgi?id=42870
References: https://bugzilla.kernel.org/show_bug.cgi?id=43349
References: https://lkml.org/lkml/2012/10/16/518
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead
of -1) to init the local copies so that functions that tries to find
cpuidle states with minimum power usage works correctly even if they use
non-negative values.
Signed-off-by: Sivaram Nair <sivaramn@nvidia.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function detect_repeating_patterns was not very useful for
workloads with alternating long and short pauses, for example
virtual machines handling network requests for each other (say
a web and database server).
Instead, try to find a recent sleep interval that is somewhere
between the median and the mode sleep time, by discarding outliers
to the up side and recalculating the average and standard deviation
until that is no longer required.
This should do something sane with a sleep interval series like:
200 180 210 10000 30 1000 170 200
The current code would simply discard such a series, while the
new code will guess a typical sleep interval just shy of 200.
The original patch come from Rik van Riel <riel@redhat.com>.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
The patch extends to general case that prediction logic get a small predicted
residency, so it choose a shallow C-state though the expected residency is large
. Once the prediction will be fail, the CPU will keep staying at shallow C-state
for a long time. Acutally, the CPU has change enter into deep C-state.
So when the expected residency is long enough but governor choose a shallow
C-state, an timer will be added in order to monitor if the prediction failure.
When C-state is waken up prior to the adding timer, the timer will be cancelled
initiatively. When the timer is triggered and menu governor will quickly notice
prediction failure and re-evaluates deeper C-states possibility.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.
There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.
In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.
Below is another case which will clearly show the patch much benefit:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>
volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;
void usage(void)
{
fprintf(stderr,
"Usage: idle_predict [options]\n"
" --help -h Print this help\n"
" --thread -n Thread number\n"
" --loop -l Loop times in shallow Cstate\n"
" --delay -t Sleep time (uS)in shallow Cstate\n");
}
void *simple_loop() {
int idle_num = 1;
while (!(*shutdown)) {
*count = *count + 1;
if (idle_num % loop)
usleep(delay);
else {
/* sleep 1 second */
usleep(1000000);
idle_num = 0;
}
idle_num++;
}
}
static void sighand(int sig)
{
*shutdown = 1;
}
int main(int argc, char *argv[])
{
sigset_t sigset;
int signum = SIGALRM;
int i, c, er = 0, thread_num = 8;
pthread_t pt[1024];
static char optstr[] = "n:l:t:h:";
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'n':
thread_num = atoi(optarg);
break;
case 'l':
loop = atoi(optarg);
break;
case 't':
delay = atoi(optarg);
break;
case 'h':
default:
usage();
exit(1);
}
printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
count = malloc(sizeof(long));
shutdown = malloc(sizeof(int));
*count = 0;
*shutdown = 0;
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigprocmask (SIG_BLOCK, &sigset, NULL);
signal(SIGINT, sighand);
signal(SIGTERM, sighand);
for(i = 0; i < thread_num ; i++)
pthread_create(&pt[i], NULL, simple_loop, NULL);
for (i = 0; i < thread_num; i++)
pthread_join(pt[i], NULL);
exit(0);
}
Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop
We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.
While after patched the kernel, we find that deep C-state will keep >99.6%.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
For the mechanism introduced by commit cbc9ef0 (PM / Domains: Add
preliminary support for cpuidle, v2) to work with the ladder
governor, that governor should respect the "disabled" state flag
added by that commit. Change the ladder governor accordingly.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
There are two cpuidle governors ladder and menu. While the ladder
governor is always available, if CONFIG_CPU_IDLE is selected, the
menu governor additionally requires CONFIG_NO_HZ.
A particular C state can be disabled by writing to the sysfs file
/sys/devices/system/cpu/cpuN/cpuidle/stateN/disable, but this mechanism
is only implemented in the menu governor. Thus, in a system where
CONFIG_NO_HZ is not selected, the ladder governor becomes default and
always will walk through all sleep states - irrespective of whether the
C state was disabled via sysfs or not. The only way to select a specific
C state was to write the related latency to /dev/cpu_dma_latency and
keep the file open as long as this setting was required - not very
practical and not suitable for setting a single core in an SMP system.
With this patch, the ladder governor only will promote to the next
C state, if it has not been disabled, and it will demote, if the
current C state was disabled.
Note that the patch does not make the setting of the sysfs variable
"disable" coherent, i.e. if one is disabling a light state, then all
deeper states are disabled as well, but the "disable" variable does not
reflect it. Likewise, if one enables a deep state but a lighter state
still is disabled, then this has no effect. A related section has been
added to the documentation.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
On some systems there are CPU cores located in the same power
domains as I/O devices. Then, power can only be removed from the
domain if all I/O devices in it are not in use and the CPU core
is idle. Add preliminary support for that to the generic PM domains
framework.
First, the platform is expected to provide a cpuidle driver with one
extra state designated for use with the generic PM domains code.
This state should be initially disabled and its exit_latency value
should be set to whatever time is needed to bring up the CPU core
itself after restoring power to it, not including the domain's
power on latency. Its .enter() callback should point to a procedure
that will remove power from the domain containing the CPU core at
the end of the CPU power transition.
The remaining characteristics of the extra cpuidle state, referred to
as the "domain" cpuidle state below, (e.g. power usage, target
residency) should be populated in accordance with the properties of
the hardware.
Next, the platform should execute genpd_attach_cpuidle() on the PM
domain containing the CPU core. That will cause the generic PM
domains framework to treat that domain in a special way such that:
* When all devices in the domain have been suspended and it is about
to be turned off, the states of the devices will be saved, but
power will not be removed from the domain. Instead, the "domain"
cpuidle state will be enabled so that power can be removed from
the domain when the CPU core is idle and the state has been chosen
as the target by the cpuidle governor.
* When the first I/O device in the domain is resumed and
__pm_genpd_poweron(() is called for the first time after
power has been removed from the domain, the "domain" cpuidle
state will be disabled to avoid subsequent surprise power removals
via cpuidle.
The effective exit_latency value of the "domain" cpuidle state
depends on the time needed to bring up the CPU core itself after
restoring power to it as well as on the power on latency of the
domain containing the CPU core. Thus the "domain" cpuidle state's
exit_latency has to be recomputed every time the domain's power on
latency is updated, which may happen every time power is restored
to the domain, if the measured power on latency is greater than
the latency stored in the corresponding generic_pm_domain structure.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Andrew J.Schorr raises a question. When he changes the disable setting on
a single CPU, it affects all the other CPUs. Basically, currently, the
disable field is per-driver instead of per-cpu. All the C states of the
same driver are shared by all CPU in the same machine.
The patch changes the `disable' field to per-cpu, so we could set this
separately for each cpu.
Signed-off-by: ShuoX Liu <shuox.liu@intel.com>
Reported-by: Andrew J.Schorr <aschorr@telemetry-investments.com>
Reviewed-by: Yanmin Zhang <yanmin_zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
power_usage is always assigned a negative value and should be declared
a signed integer
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Some C states of new CPU might be not good. One reason is BIOS might
configure them incorrectly. To help developers root cause it quickly, the
patch adds a new sysfs entry, so developers could disable specific C state
manually.
In addition, C state might have much impact on performance tuning, as it
takes much time to enter/exit C states, which might delay interrupt
processing. With the new debug option, developers could check if a deep C
state could impact performance and how much impact it could cause.
Also add this option in Documentation/cpuidle/sysfs.txt.
[akpm@linux-foundation.org: check kstrtol return value]
Signed-off-by: ShuoX Liu <shuox.liu@intel.com>
Reviewed-by: Yanmin Zhang <yanmin_zhang@intel.com>
Reviewed-and-Tested-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
cpuidle: Single/Global registration of idle states
cpuidle: Split cpuidle_state structure and move per-cpu statistics fields
cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare()
cpuidle: Move dev->last_residency update to driver enter routine; remove dev->last_state
ACPI: Fix CONFIG_ACPI_DOCK=n compiler warning
ACPI: Export FADT pm_profile integer value to userspace
thermal: Prevent polling from happening during system suspend
ACPI: Drop ACPI_NO_HARDWARE_INIT
ACPI atomicio: Convert width in bits to bytes in __acpi_ioremap_fast()
PNPACPI: Simplify disabled resource registration
ACPI: Fix possible recursive locking in hwregs.c
ACPI: use kstrdup()
mrst pmu: update comment
tools/power turbostat: less verbose debugging
This patch makes the cpuidle_states structure global (single copy)
instead of per-cpu. The statistics needed on per-cpu basis
by the governor are kept per-cpu. This simplifies the cpuidle
subsystem as state registration is done by single cpu only.
Having single copy of cpuidle_states saves memory. Rare case
of asymmetric C-states can be handled within the cpuidle driver
and architectures such as POWER do not have asymmetric C-states.
Having single/global registration of all the idle states,
dynamic C-state transitions on x86 are handled by
the boot cpu. Here, the boot cpu would disable all the devices,
re-populate the states and later enable all the devices,
irrespective of the cpu that would receive the notification first.
Reference:
https://lkml.org/lkml/2011/4/25/83
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The cpuidle_device->prepare() mechanism causes updates to the
cpuidle_state[].flags, setting and clearing CPUIDLE_FLAG_IGNORE
to tell the governor not to chose a state on a per-cpu basis at
run-time. State demotion is now handled by the driver and it returns
the actual state entered. Hence, this mechanism is not required.
Also this removes per-cpu flags from cpuidle_state enabling
it to be made global.
Reference:
https://lkml.org/lkml/2011/3/25/52
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm>
Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cpuidle governor only suggests the state to enter using the
governor->select() interface, but allows the low level driver to
override the recommended state. The actual entered state
may be different because of software or hardware demotion. Software
demotion is done by the back-end cpuidle driver and can be accounted
correctly. Current cpuidle code uses last_state field to capture the
actual state entered and based on that updates the statistics for the
state entered.
Ideally the driver enter routine should update the counters,
and it should return the state actually entered rather than the time
spent there. The generic cpuidle code should simply handle where
the counters live in the sysfs namespace, not updating the counters.
Reference:
https://lkml.org/lkml/2011/3/25/52
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Trinabh Gupta <g.trinabh@gmail.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Len Brown <len.brown@intel.com>
This file has module_init/exit and MODULE_LICENSE, and so it
needs the full module.h header.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>