Commit 09ca4c9b59 (cpufreq: powernv: Replacing pstate_id with
frequency table index) changes calc_global_pstate() to use
cpufreq_table index instead of pstate_id.
But in gpstate_timer_handler(), pstate_id was being passed instead
of cpufreq_table index, which caused index_to_pstate() to access
out of bound indices, leading to this crash.
Adding sanity check for index and pstate, to ensure only valid pstate
and index values are returned.
Call Trace:
[c00000078d66b130] [c00000000011d224] __free_irq+0x234/0x360
(unreliable)
[c00000078d66b1c0] [c00000000011d44c] free_irq+0x6c/0xa0
[c00000078d66b1f0] [c00000000006c4f8] opal_event_shutdown+0x88/0xd0
[c00000078d66b230] [c000000000067a4c] opal_shutdown+0x1c/0x90
[c00000078d66b260] [c000000000063a00] pnv_shutdown+0x20/0x40
[c00000078d66b280] [c000000000021538] machine_restart+0x38/0x90
[c0000000078d66b310] [c000000000965ea0] panic+0x284/0x300
[c00000078d66b3a0] [c00000000001f508] die+0x388/0x450
[c00000078d66b430] [c000000000045a50] bad_page_fault+0xd0/0x140
[c00000078d66b4a0] [c000000000008964] handle_page_fault+0x2c/0x30
interrupt: 300 at gpstate_timer_handler+0x150/0x260
LR = gpstate_timer_handler+0x130/0x260
[c00000078d66b7f0] [c000000000132b58] call_timer_fn+0x58/0x1c0
[c00000078d66b880] [c000000000132e20] expire_timers+0x130/0x1d0
[c00000078d66b8f0] [c000000000133068] run_timer_softirq+0x1a8/0x230
[c00000078d66b980] [c0000000000b535c] __do_softirq+0x18c/0x400
[c00000078d66ba70] [c0000000000b5828] irq_exit+0xc8/0x100
[c00000078d66ba90] [c00000000001e214] timer_interrupt+0xa4/0xe0
[c00000078d66bac0] [c0000000000027d0] decrementer_common+0x150/0x180
interrupt: 901 at arch_local_irq_restore+0x74/0x90
0] [c000000000106b34] call_cpuidle+0x44/0x90
[c00000078d66be50] [c00000000010708c] cpu_startup_entry+0x38c/0x460
[c00000078d66bf20] [c00000000003d930] start_secondary+0x330/0x380
[c00000078d66bf90] [c000000000008e6c] start_secondary_prolog+0x10/0x14
Fixes: 09ca4c9b59 (cpufreq: powernv: Replacing pstate_id with frequency table index)
Reported-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Prevent the low-level assembly hibernate code on x86-64 from
referring to __PAGE_OFFSET directly as a symbol which doesn't work
when the kernel identity mapping base is randomized, in which case
__PAGE_OFFSET is a variable (Rafael Wysocki).
- Avoid selecting CPU_FREQ_STAT by default as the statistics are not
required for proper cpufreq operation (Borislav Petkov).
- Add Skylake-X and Broadwell-X IDs to the intel_pstate's list of
processors where out-of-band (OBB) control of P-states is possible
and if that is in use, intel_pstate should not attempt to manage
P-states (Srinivas Pandruvada).
- Drop some unnecessary checks from the wakeup IRQ handling code in
the PM core (Markus Elfring).
- Reduce the number operating performance point (OPP) lookups in
one of the OPP framework's helper functions (Jisheng Zhang).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXpKLNAAoJEILEb/54YlRx6L4P/39GR0kVB7vIyCajdWc4f3gh
0zh5RbKC0YT4F6upebzgQ9uYS9cv4y+df7ShVwKQAea3wDReEmZhM/egOGw0Ls+8
SS1MiJq1LSekyMIWH6cXwZsH69/V0LuWTBbzWYBgHUDbfEMlgwV5ZZMEH14/2bWw
d4SLUUiW5P42im+IDAxpdYneKOrbJo3txj6WbOutgtIrHdPko6lF1dDouKvI1QTk
zCBOkEB9nELq3rWN/sbPmHzbmbj/yFiiHk5+iqwuKKJZI8PQB9/C6Qmc3wvjtPpc
GXLPI+OHqLgBMofGsiKOvm3hPQAIjf/ERsUilHLE3qOi/Mi0qj7U4dFivZrPWaCG
j2bD+b36TffmG1r8L7NYbEKU60syeIFSRqAngbyswu6XF+NdVboaifENGcgM3tBC
pUC1mFh/4PMKP5zW9mOwE6WSntZkw14CVR+A3fRFOuTBavNvjGdckwLi/aBsZdU3
K4DJUFzdELF7+JdqnQV35yV2tgMbJhxQa/QBykiFBh3AyqliOZ8uBoIxxLinrGmH
XWR3kB4ZPRBIStGI9IpG3lNhLLU7mLIdaFhGayicicAwFcLsXN7oKWVzYfUqWA9T
1ptXApcRf6S9J1JnKHkznoQ1/D1pNYLbH+t7BOtWdK8SWBhMS2Lzo2kyKWsCFkJS
16J8j1tcLDhN1dvcBT55
=WvPB
-----END PGP SIGNATURE-----
Merge tag 'pm-extra-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"A few more fixes and cleanups in the x86-64 low-level hibernation
code, PM core, cpufreq (Kconfig and intel_pstate), and the operating
points framework.
Specifics:
- Prevent the low-level assembly hibernate code on x86-64 from
referring to __PAGE_OFFSET directly as a symbol which doesn't work
when the kernel identity mapping base is randomized, in which case
__PAGE_OFFSET is a variable (Rafael Wysocki).
- Avoid selecting CPU_FREQ_STAT by default as the statistics are not
required for proper cpufreq operation (Borislav Petkov).
- Add Skylake-X and Broadwell-X IDs to the intel_pstate's list of
processors where out-of-band (OBB) control of P-states is possible
and if that is in use, intel_pstate should not attempt to manage
P-states (Srinivas Pandruvada).
- Drop some unnecessary checks from the wakeup IRQ handling code in
the PM core (Markus Elfring).
- Reduce the number operating performance point (OPP) lookups in one
of the OPP framework's helper functions (Jisheng Zhang)"
* tag 'pm-extra-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
x86/power/64: Do not refer to __PAGE_OFFSET from assembly code
cpufreq: Do not default-yes CPU_FREQ_STAT
cpufreq: intel_pstate: Add more out-of-band IDs
PM / OPP: optimize dev_pm_opp_set_rate() performance a bit
PM-wakeup: Delete unnecessary checks before three function calls
* pm-sleep:
x86/power/64: Do not refer to __PAGE_OFFSET from assembly code
* pm-cpufreq:
cpufreq: Do not default-yes CPU_FREQ_STAT
cpufreq: intel_pstate: Add more out-of-band IDs
* pm-core:
PM-wakeup: Delete unnecessary checks before three function calls
* pm-opp:
PM / OPP: optimize dev_pm_opp_set_rate() performance a bit
CPU frequency transition statistics are not absolutely required for
proper cpufreq operation on the system AFAICT so remove the default-yes
setting in Kconfig.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Driver updates for ARM SoCs.
A slew of changes this release cycle. The reset driver tree, that we merge
through arm-soc for historical reasons, is also sizable this time around.
Among the changes:
- clps711x: Treewide changes to compatible strings, merged here for simplicity.
- Qualcomm: SCM firmware driver cleanups, move to platform driver
- ux500: Major cleanups, removal of old mach-specific infrastructure.
- Atmel external bus memory driver
- Move of brcmstb platform to the rest of bcm
- PMC driver updates for tegra, various fixes and improvements
- Samsung platform driver updates to support 64-bit Exynos platforms
- Reset controller cleanups moving to devm_reset_controller_register() APIs
- Reset controller driver for Amlogic Meson
- Reset controller driver for Hisilicon hi6220
- ARM SCPI power domain support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXnm1XAAoJEIwa5zzehBx35lcP/ApuQarIXeZCQZtjlUBV9McW
o3o7FhKFHePmEPeoYCvVeK5D8NykTkQv3WpnCknoxPJzxGJF7jbPWQJcVnXfKOXD
kTcyIK15WL2HHtSE3lYyLfyUPz8AbJyRt0l0cxgcg6jvo+uzlWooNz1y78rLIYzg
UwRssj7OiHv4dsyYRHZIsjnB8gMWw8rYMk154gP2xy6MnNXXzzOVVnOiVqxSZBm+
EgIIcROMOqkkHuFlClMYKluIgrmgz1Ypjf+FuAg7dqXZd+TGRrmGXeI7SkGThfLu
nyvY3N18NViNu7xOUkI9zg7+ifyYM8Si9ylalSICSJdIAxZfiwFqFaLJvVWKU1rY
rBOBjKckQI0/X9WYusFNFHcijhIFV8/FgGAnVRRMPdvlCss7Zp03C9mR4AEhmKMX
rLG49x81hU1C+LftC59ml3iB8dhZrrRkbxNHjLFHVGWNrKMrmJKa8JhXGRAoNM+u
LRauiuJZatqvLfISNvpfcoW2EashVoU3f+uC8ymT3QCyME3wZm0t7T4tllxhMfBl
sOgJqNkTKDmPLofwm/dASiLML7ZF1WePScrFyOACnj9K4mUD+OaCnowtWoQPu0eI
aNmT84oosJ2S9F/iUDPtFHXdzQ+1QPPfSiQ9FXMoauciVq/2F+pqq68yYgqoxFOG
vmkmG2YM4Wyq43u0BONR
=O8+y
-----END PGP SIGNATURE-----
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC driver updates from Olof Johansson:
"Driver updates for ARM SoCs.
A slew of changes this release cycle. The reset driver tree, that we
merge through arm-soc for historical reasons, is also sizable this
time around.
Among the changes:
- clps711x: Treewide changes to compatible strings, merged here for simplicity.
- Qualcomm: SCM firmware driver cleanups, move to platform driver
- ux500: Major cleanups, removal of old mach-specific infrastructure.
- Atmel external bus memory driver
- Move of brcmstb platform to the rest of bcm
- PMC driver updates for tegra, various fixes and improvements
- Samsung platform driver updates to support 64-bit Exynos platforms
- Reset controller cleanups moving to devm_reset_controller_register() APIs
- Reset controller driver for Amlogic Meson
- Reset controller driver for Hisilicon hi6220
- ARM SCPI power domain support"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (100 commits)
ARM: ux500: consolidate base platform files
ARM: ux500: move soc_id driver to drivers/soc
ARM: ux500: call ux500_setup_id later
ARM: ux500: consolidate soc_device code in id.c
ARM: ux500: remove cpu_is_u* helpers
ARM: ux500: use CLK_OF_DECLARE()
ARM: ux500: move l2x0 init to .init_irq
mfd: db8500 stop passing around platform data
ASoC: ab8500-codec: remove platform data based probe
ARM: ux500: move ab8500_regulator_plat_data into driver
ARM: ux500: remove unused regulator data
soc: raspberrypi-power: add CONFIG_OF dependency
firmware: scpi: add CONFIG_OF dependency
video: clps711x-fb: Changing the compatibility string to match with the smallest supported chip
input: clps711x-keypad: Changing the compatibility string to match with the smallest supported chip
pwm: clps711x: Changing the compatibility string to match with the smallest supported chip
serial: clps711x: Changing the compatibility string to match with the smallest supported chip
irqchip: clps711x: Changing the compatibility string to match with the smallest supported chip
clocksource: clps711x: Changing the compatibility string to match with the smallest supported chip
clk: clps711x: Changing the compatibility string to match with the smallest supported chip
...
Add Skylake-X and Broadwell-X IDs for out-of-band (OBB) control of
P-States.
For these processors, if MSR_MISC_PWR_MGMT BIT(8) == 1, then the
Intel P-State driver should exit as OS can't control P-States.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Subject/changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Rework the cpufreq governor interface to make it more straightforward
and modify the conservative governor to avoid using transition
notifications (Rafael Wysocki).
- Rework the handling of frequency tables by the cpufreq core to make
it more efficient (Viresh Kumar).
- Modify the schedutil governor to reduce the number of wakeups it
causes to occur in cases when the CPU frequency doesn't need to be
changed (Steve Muckle, Viresh Kumar).
- Fix some minor issues and clean up code in the cpufreq core and
governors (Rafael Wysocki, Viresh Kumar).
- Add Intel Broxton support to the intel_pstate driver (Srinivas
Pandruvada).
- Fix problems related to the config TDP feature and to the validity
of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
Srinivas Pandruvada).
- Make intel_pstate update the cpu_frequency tracepoint even if
the frequency doesn't change to avoid confusing powertop (Rafael
Wysocki).
- Clean up the usage of __init/__initdata in intel_pstate, mark some
of its internal variables as __read_mostly and drop an unused
structure element from it (Jisheng Zhang, Carsten Emde).
- Clean up the usage of some duplicate MSR symbols in intel_pstate
and turbostat (Srinivas Pandruvada).
- Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
Adiga, Viresh Kumar, Ben Dooks).
- Fix a regression (introduced during the 4.5 cycle) in the
pcc-cpufreq driver by reverting the problematic commit (Andreas
Herrmann).
- Add support for Intel Denverton to intel_idle, clean up Broxton
support in it and make it explicitly non-modular (Jacob Pan,
Jan Beulich, Paul Gortmaker).
- Add support for Denverton and Ivy Bridge server to the Intel RAPL
power capping driver and make it more careful about the handing
of MSRs that may not be present (Jacob Pan, Xiaolong Wang).
- Fix resume from hibernation on x86-64 by making the CPU offline
during resume avoid using MONITOR/MWAIT in the "play dead" loop
which may lead to an inadvertent "revival" of a "dead" CPU and
a page fault leading to a kernel crash from it (Rafael Wysocki).
- Make memory management during resume from hibernation more
straightforward (Rafael Wysocki).
- Add debug features that should help to detect problems related
to hibernation and resume from it (Rafael Wysocki, Chen Yu).
- Clean up hibernation core somewhat (Rafael Wysocki).
- Prevent KASAN from instrumenting the hibernation core which leads
to large numbers of false-positives from it (James Morse).
- Prevent PM (hibernate and suspend) notifiers from being called
during the cleanup phase if they have not been called during the
corresponding preparation phase which is possible if one of the
other notifiers returns an error at that time (Lianwei Wang).
- Improve suspend-related debug printout in the tasks freezer and
clean up suspend-related console handling (Roger Lu, Borislav
Petkov).
- Update the AnalyzeSuspend script in the kernel sources to
version 4.2 (Todd Brandt).
- Modify the generic power domains framework to make it handle
system suspend/resume better (Ulf Hansson).
- Make the runtime PM framework avoid resuming devices synchronously
when user space changes the runtime PM settings for them and
improve its error reporting (Rafael Wysocki, Linus Walleij).
- Fix error paths in devfreq drivers (exynos, exynos-ppmu, exynos-bus)
and in the core, make some devfreq code explicitly non-modular and
change some of it into tristate (Bartlomiej Zolnierkiewicz,
Peter Chen, Paul Gortmaker).
- Add DT support to the generic PM clocks management code and make
it export some more symbols (Jon Hunter, Paul Gortmaker).
- Make the PCI PM core code slightly more robust against possible
driver errors (Andy Shevchenko).
- Make it possible to change DESTDIR and PREFIX in turbostat
(Andy Shevchenko).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXl7/dAAoJEILEb/54YlRx+VgQAIQJOWvxKew3Yl02c/sdj9OT
5VNnFrzGzdcAPofvvG9qGq8B0Es1vYehJpwwOB21ri8EvYv0riIiU1yrqslObojQ
oaZOkSBpbIoKjGR4CpYA/A+feE+8EqIBdPGd+lx5a6oRdUi7tRVHBG9lyLO3FB/i
jan1q8dMpZsmu+Y+rVVHGnCVuIlIEqr2ZnZfCwDAulO2Arp/QFAh4kH08ELATvrl
bkPa25vq7/VMP/vCDzrfZKD5mUuKogIRu/J5wx4py1nE+FB35cKKyqBOgklLwAeY
UI8vjDhr/myNUs54AZlktOkq47TCYvjvhX9kmOxBjuWqFbRusU012IRek1fYPRIV
ZqbkqNX7UEVQwunAEg9AyFwyzEtOht93dQDT5RLEd4QzKuM76gmHpLeTGGMzE+nu
FnmF9JGl4DVwqpZl9yU2+hR2Mt3bP8OF8qYmNiGUB3KO4emPslhSd+6y8liA5Bx2
SJf0Gb//vaHCh3/uMnwAonYPqRkZvBLOMwuL1VUjNQfRMnQtDdgHMYB1aT/EglPA
8ww6j4J8rVRLAxvYQ3UEmNA/vBNclKXblRR18+JddEZP9/oX0ATfwnCCUpr839uk
xxyQhrm4/AI60+PHWCX4GG80YrKdOGTkF7LXCQZanVWjjuyF17rufegZ2YWLT07v
JU1Cmumfdy2jJluT8xsR
=uVGz
-----END PGP SIGNATURE-----
Merge tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"Again, the majority of changes go into the cpufreq subsystem, but
there are no big features this time. The cpufreq changes that stand
out somewhat are the governor interface rework and improvements
related to the handling of frequency tables. Apart from those, there
are fixes and new device/CPU IDs in drivers, cleanups and an
improvement of the new schedutil governor.
Next, there are some changes in the hibernation core, including a fix
for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
during resume from hibernation, a few core improvements related to
memory management during resume, a couple of additional debug features
and cleanups.
Finally, we have some fixes and cleanups in the devfreq subsystem,
generic power domains framework improvements related to system
suspend/resume, support for some new chips in intel_idle and in the
power capping RAPL driver, a new version of the AnalyzeSuspend utility
and some assorted fixes and cleanups.
Specifics:
- Rework the cpufreq governor interface to make it more
straightforward and modify the conservative governor to avoid using
transition notifications (Rafael Wysocki).
- Rework the handling of frequency tables by the cpufreq core to make
it more efficient (Viresh Kumar).
- Modify the schedutil governor to reduce the number of wakeups it
causes to occur in cases when the CPU frequency doesn't need to be
changed (Steve Muckle, Viresh Kumar).
- Fix some minor issues and clean up code in the cpufreq core and
governors (Rafael Wysocki, Viresh Kumar).
- Add Intel Broxton support to the intel_pstate driver (Srinivas
Pandruvada).
- Fix problems related to the config TDP feature and to the validity
of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
Srinivas Pandruvada).
- Make intel_pstate update the cpu_frequency tracepoint even if the
frequency doesn't change to avoid confusing powertop (Rafael
Wysocki).
- Clean up the usage of __init/__initdata in intel_pstate, mark some
of its internal variables as __read_mostly and drop an unused
structure element from it (Jisheng Zhang, Carsten Emde).
- Clean up the usage of some duplicate MSR symbols in intel_pstate
and turbostat (Srinivas Pandruvada).
- Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
Adiga, Viresh Kumar, Ben Dooks).
- Fix a regression (introduced during the 4.5 cycle) in the
pcc-cpufreq driver by reverting the problematic commit (Andreas
Herrmann).
- Add support for Intel Denverton to intel_idle, clean up Broxton
support in it and make it explicitly non-modular (Jacob Pan, Jan
Beulich, Paul Gortmaker).
- Add support for Denverton and Ivy Bridge server to the Intel RAPL
power capping driver and make it more careful about the handing of
MSRs that may not be present (Jacob Pan, Xiaolong Wang).
- Fix resume from hibernation on x86-64 by making the CPU offline
during resume avoid using MONITOR/MWAIT in the "play dead" loop
which may lead to an inadvertent "revival" of a "dead" CPU and a
page fault leading to a kernel crash from it (Rafael Wysocki).
- Make memory management during resume from hibernation more
straightforward (Rafael Wysocki).
- Add debug features that should help to detect problems related to
hibernation and resume from it (Rafael Wysocki, Chen Yu).
- Clean up hibernation core somewhat (Rafael Wysocki).
- Prevent KASAN from instrumenting the hibernation core which leads
to large numbers of false-positives from it (James Morse).
- Prevent PM (hibernate and suspend) notifiers from being called
during the cleanup phase if they have not been called during the
corresponding preparation phase which is possible if one of the
other notifiers returns an error at that time (Lianwei Wang).
- Improve suspend-related debug printout in the tasks freezer and
clean up suspend-related console handling (Roger Lu, Borislav
Petkov).
- Update the AnalyzeSuspend script in the kernel sources to version
4.2 (Todd Brandt).
- Modify the generic power domains framework to make it handle system
suspend/resume better (Ulf Hansson).
- Make the runtime PM framework avoid resuming devices synchronously
when user space changes the runtime PM settings for them and
improve its error reporting (Rafael Wysocki, Linus Walleij).
- Fix error paths in devfreq drivers (exynos, exynos-ppmu,
exynos-bus) and in the core, make some devfreq code explicitly
non-modular and change some of it into tristate (Bartlomiej
Zolnierkiewicz, Peter Chen, Paul Gortmaker).
- Add DT support to the generic PM clocks management code and make it
export some more symbols (Jon Hunter, Paul Gortmaker).
- Make the PCI PM core code slightly more robust against possible
driver errors (Andy Shevchenko).
- Make it possible to change DESTDIR and PREFIX in turbostat (Andy
Shevchenko)"
* tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
PM / hibernate: Introduce test_resume mode for hibernation
cpufreq: export cpufreq_driver_resolve_freq()
cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
PCI / PM: check all fields in pci_set_platform_pm()
cpufreq: acpi-cpufreq: use cached frequency mapping when possible
cpufreq: schedutil: map raw required frequency to driver frequency
cpufreq: add cpufreq_driver_resolve_freq()
cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
intel_pstate: Update cpu_frequency tracepoint every time
cpufreq: intel_pstate: clean remnant struct element
PM / tools: scripts: AnalyzeSuspend v4.2
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
cpufreq: powernv: Replacing pstate_id with frequency table index
intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
PM / hibernate: Image data protection during restoration
PM / hibernate: Add missing braces in __register_nosave_region()
PM / hibernate: Clean up comments in snapshot.c
PM / hibernate: Clean up function headers in snapshot.c
PM / hibernate: Add missing braces in hibernate_setup()
...
Pull timer updates from Thomas Gleixner:
"This update provides the following changes:
- The rework of the timer wheel which addresses the shortcomings of
the current wheel (cascading, slow search for next expiring timer,
etc). That's the first major change of the wheel in almost 20
years since Finn implemted it.
- A large overhaul of the clocksource drivers init functions to
consolidate the Device Tree initialization
- Some more Y2038 updates
- A capability fix for timerfd
- Yet another clock chip driver
- The usual pile of updates, comment improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
tick/nohz: Optimize nohz idle enter
clockevents: Make clockevents_subsys static
clocksource/drivers/time-armada-370-xp: Fix return value check
timers: Implement optimization for same expiry time in mod_timer()
timers: Split out index calculation
timers: Only wake softirq if necessary
timers: Forward the wheel clock whenever possible
timers/nohz: Remove pointless tick_nohz_kick_tick() function
timers: Optimize collect_expired_timers() for NOHZ
timers: Move __run_timers() function
timers: Remove set_timer_slack() leftovers
timers: Switch to a non-cascading wheel
timers: Reduce the CPU index space to 256k
timers: Give a few structs and members proper names
hlist: Add hlist_is_singular_node() helper
signals: Use hrtimer for sigtimedwait()
timers: Remove the deprecated mod_timer_pinned() API
timers, net/ipv4/inet: Initialize connection request timers as pinned
timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
timers, drivers/tty/metag_da: Initialize the poll timer as pinned
...
Pull x86 platform updates from Ingo Molnar:
"The main changes in this cycle were:
- Intel-SoC enhancements (Andy Shevchenko)
- Intel CPU symbolic model definition rework (Dave Hansen)
- ... other misc changes"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
x86/sfi: Enable enumeration of SD devices
x86/pci: Use MRFLD abbreviation for Merrifield
x86/platform/intel-mid: Make vertical indentation consistent
x86/platform/intel-mid: Mark regulators explicitly defined
x86/platform/intel-mid: Rename mrfl.c to mrfld.c
x86/platform/intel-mid: Enable spidev on Intel Edison boards
x86/platform/intel-mid: Extend PWRMU to support Penwell
x86/pci, x86/platform/intel_mid_pci: Remove duplicate power off code
x86/platform/intel-mid: Add pinctrl for Intel Merrifield
x86/platform/intel-mid: Enable GPIO expanders on Edison
x86/platform/intel-mid: Add Power Management Unit driver
x86/platform/atom/punit: Enable support for Merrifield
x86/platform/intel_mid_pci: Rework IRQ0 workaround
x86, thermal: Clean up and fix CPU model detection for intel_soc_dts_thermal
x86, mmc: Use Intel family name macros for mmc driver
x86/intel_telemetry: Use Intel family name macros for telemetry driver
x86/acpi/lss: Use Intel family name macros for the acpi_lpss driver
x86/cpufreq: Use Intel family name macros for the intel_pstate cpufreq driver
x86/platform: Use new Intel model number macros
x86/intel_idle: Use Intel family macros for intel_idle
...
This reverts commit 790d849bf8.
Using a v4.7-rc7 kernel on a HP ProLiant triggered following messages
pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
cpufreq: ondemand governor failed, too long transition latency of HW, fallback to performance governor
The last line was shown for each CPU in the system.
Testing v4.5 (where commit 790d849b was integrated) triggered
similar messages. Same behaviour on a 2nd HP Proliant system.
So commit 790d849bf (cpufreq: pcc-cpufreq: update default value of
cpuinfo_transition_latency) causes the system to use performance
governor which, I guess, was not the intention of the patch.
Enabling debug output in pcc-cpufreq provides following verbose output:
pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
pcc_get_offset: for CPU 0: pcc_cpu_data input_offset: 0x44, pcc_cpu_data output_offset: 0x48
init: policy->max is 2800000, policy->min is 1200000
get: get_freq for CPU 0
get: SUCCESS: (virtual) output_offset for cpu 0 is 0xffffc9000d7c0048, contains a value of: 0xff06. Speed is: 168000 MHz
cpufreq: ondemand governor failed, too long transition latency of HW, fallback to performance governor
target: CPU 0 should go to target freq: 2800000 (virtual) input_offset is 0xffffc9000d7c0044
target: was SUCCESSFUL for cpu 0
I am asking to revert 790d849bf to re-enable usage of ondemand
governor with pcc-cpufreq.
Fixes: 790d849bf (cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency)
CC: <stable@vger.kernel.org> # 4.5+
Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Export cpufreq_driver_resolve_freq() since governors may be compiled as
modules.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The handlers provided by cpufreq core are sufficient for resolving the
frequency for drivers providing ->target_index(), as the core already
has the frequency table and so ->resolve_freq() isn't required for such
platforms.
This patch disallows drivers with ->target_index() callback to use the
->resolve_freq() callback.
Also, it fixes a potential kernel crash for drivers providing ->target()
but no ->resolve_freq().
Fixes: e3c0623608 "cpufreq: add cpufreq_driver_resolve_freq()"
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A call to cpufreq_driver_resolve_freq will cache the mapping from
the desired target frequency to the frequency table index. If there
is a mapping for the desired target frequency then use it instead of
looking up the mapping again.
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cpufreq governors may need to know what a particular target frequency
maps to in the driver without necessarily wanting to set the frequency.
Support this operation via a new cpufreq API,
cpufreq_driver_resolve_freq(). This API returns the lowest driver
frequency equal or greater than the target frequency
(CPUFREQ_RELATION_L), subject to any policy (min/max) or driver
limitations. The mapping is also cached in the policy so that a
subsequent fast_switch operation can avoid repeating the same lookup.
The API will call a new cpufreq driver callback, resolve_freq(), if it
has been registered by the driver. Otherwise the frequency is resolved
via cpufreq_frequency_table_target(). Rather than require ->target()
style drivers to provide a resolve_freq() callback it is left to the
caller to ensure that the driver implements this callback if necessary
to use cpufreq_driver_resolve_freq().
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The MSR MSR_HWP_INTERRUPT is valid only when CPUID.06H:EAX[8] = 1, so
check for feature before accessing this MSR.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, intel_pstate only updates the cpu_frequency tracepoint
if the new P-state to set is different from the current one, but
that causes powertop to report 100% idle on an 100% loaded system
sometimes.
Prevent that from happening by updating the cpu_frequency tracepoint
every time intel_pstate_update_pstate() is called.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>-
When I was working with the Intel P state driver I came across a
remnant struct element that is no longer needed after the function
intel_pstate_calc_freq() was retired.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Refactoring code to use frequency table index instead of pstate_id.
This abstraction will make the code independent of the pstate values.
- No functional changes
- The highest frequency is at frequency table index 0 and the frequency
decreases as the index increases.
- Macros pstates_to_idx() and idx_to_pstate() can be used for conversion
between pstate_id and index.
- powernv_pstate_info now contains frequency table index to min, max and
nominal frequency (instead of pstate_ids)
- global_pstate_info new stores index values instead pstate ids.
- variables renamed as *_idx which now store index instead of pstate
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If MSR_CONFIG_TDP_CONTROL is locked, we currently try to address some
MSR 0x80000648 or so. Mask out the relevant level bits 0 and 1.
Found while running over the Jailhouse hypervisor which became upset
about this strange MSR index.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Replace MSR_NHM_TURBO_RATIO_LIMIT with MSR_TURBO_RATIO_LIMIT.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pinned timers must carry the pinned attribute in the timer structure
itself, so convert the code to the new API.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.297014487@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch migrates few users of cpufreq tables to the new helpers
that work on sorted freq-tables.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq drivers aren't required to provide a sorted frequency table
today, and even the ones which provide a sorted table aren't handled
efficiently by cpufreq core.
This patch adds infrastructure to verify if the freq-table provided by
the drivers is sorted or not, and use efficient helpers if they are
sorted.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Both callers of cpufreq_update_current_freq(), cpufreq_update_policy()
and cpufreq_start_governor(), check cpufreq_suspended before calling
that function, so drop the redundant cpufreq_suspended check from it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
CPU notifications from the firmware coming in when cpufreq is
suspended cause cpufreq_update_current_freq() to return 0 which
triggers the WARN_ON() in cpufreq_update_policy() for no reason.
Avoid that by checking cpufreq_suspended before calling
cpufreq_update_current_freq().
Fixes: c9d9c929e6 (cpufreq: Abort cpufreq_update_current_freq() for cpufreq_suspended set)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
pid_params is written once by copy_pid_params() during initialization,
and thereafter is mostly read by hot path intel_pstate_update_util().
The read of pid_params gets more after commit a4675fbc4a ("cpufreq:
intel_pstate: Replace timers with utilization update callbacks")
pstate_funcs is written once by copy_cpu_funcs() during initialization,
and thereafter is mostly read by hot path intel_pstate_update_util()
hwp_active is written to once during initialization and thereafter is
mostly read by hot path intel_pstate_update_util().
The fact that they are mostly read and not written to makes them
candidates for __read_mostly declarations.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
These functions/variables are not needed after booting, so mark them
as __init or __initdata.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
__initdata should be placed between the variable name and equal sign
(if there is) for the variable to be placed in the intended section.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If of_match_node() fails, this init function bails out without
calling of_node_put().
Also change of_node_put(of_root) to of_node_put(np); both of them
hold the same pointer, but it seems better to call of_node_put()
against the node returned by of_find_node_by_path().
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
intel_pstate_set_policy() is invoked by the cpufreq core during
driver initialization, on changes of policy attributes (minimim and
maximum frequency, for example) via sysfs and via CPU notifications
from the platform firmware. On some platforms the latter may occur
relatively often.
Commit bb6ab52f2b (intel_pstate: Do not set utilization update hook
too early) made intel_pstate_set_policy() clear the CPU's utilization
update hook before updating the policy attributes for it (and set the
hook again after doind that), but that involves invoking
synchronize_sched() and adds overhead to the CPU notifications
mentioned above and to the sched-RCU handling in general.
That extra overhead is arguably not necessary, because updating
policy attributes when the CPU's utilization update hook is active
should not lead to any adverse effects, so drop the clearing of
the hook from intel_pstate_set_policy() and make it check if
the hook has been set already when attempting to set it.
Fixes: bb6ab52f2b (intel_pstate: Do not set utilization update hook too early)
Reported-by: Jisheng Zhang <jszhang@marvell.com>
Tested-by: Jisheng Zhang <jszhang@marvell.com>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 920de6ebfa (ACPICA: Hardware: Enhance
acpi_hw_validate_register() with access_width/bit_offset awareness)
apparently exposed a latent bug, doorbell.access_width is initialized
to 64, but per Lv Zheng, it should be 4, and indeed, making that
change does bring pcc-cpufreq back to life.
Fixes: 920de6ebfa (ACPICA: Hardware: Enhance acpi_hw_validate_register() with access_width/bit_offset awareness)
Suggested-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The use of __raw IO accesors is not endian safe and should be used
sparingly. The relaxed variants should be as lightweight and also
are endian safe.
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
The maximum turbo P-State used by the intel_pstate driver may be
limited by ACPI _PSS table entry 0. After commit 9522a2ff9c
(cpufreq: intel_pstate: Enforce _PPC limits), the maximum performance
on servers will be capped by the _PSS table entry 0 by default.
Even though that is formally correct, it may lead to preformance
regressions in some cases. Namely, if the _PSS table entry 0 is
not the maximum turbo P-State, performance measured after commit
9522a2ff9c will not match the performance measured before that
commit on the same system.
For this reason, modify the code to always use the maximum turbo
frequency as the one that corresponds to _PSS table entry 0 if turbo
is enabled in the BIOS. This way, the performance levels from
before commit 9522a2ff9c will be restored on the affected systems.
Fixes: 9522a2ff9c (cpufreq: intel_pstate: Enforce _PPC limits)
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix the use of 0 instead of NULL to clk_get() call. This stops the
following warning:
drivers/cpufreq/mvebu-cpufreq.c:73:40: warning: Using plain integer as NULL pointer
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add Broxton CPU model number.
Broxton requires core_params to get performance limits via MSRs, but
it is an Atom platform, which requires more power optimized algorithm.
So the P state selection will use similar algorithm as other Atom
platforms.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The conservative governor registers a transition notifier so it
can update its internal requested_freq value if it falls out of the
policy->min...policy->max range, but requested_freq is not really
necessary.
That value is used to track the frequency requested by the governor
previously, but policy->cur can be used instead of it and then the
governor will not have to worry about updating the tracked value when
the current frequency changes independently (for example, as a result
of min or max changes).
Accodringly, drop requested_freq from struct cs_policy_dbs_info
and modify cs_dbs_timer() to use policy->cur instead of it.
While at it, notice that __cpufreq_driver_target() clamps its
target_freq argument between policy->min and policy->max, so
the callers of it don't have to do that and make additional
changes in cs_dbs_timer() in accordance with that.
After these changes the transition notifier used by the conservative
governor is not necessary any more, so drop it, which also makes it
possible to drop the struct cs_governor definition and simplify the
code accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* pm-cpufreq-fixes:
cpufreq: intel_pstate: Fix ->set_policy() interface for no_turbo
cpufreq: intel_pstate: Fix code ordering in intel_pstate_set_policy()
* pm-cpuidle:
cpuidle: Do not access cpuidle_devices when !CONFIG_CPU_IDLE
There's no reason for gov_cancel_work() to exist at all, as it only
has one caller and the only thing done by that caller is to invoke
gov_cancel_work().
Accordingly, drop gov_cancel_work() and move its contents to the
caller.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
policy->freq_table will always be valid for this platform, otherwise
driver's probe() would fail. And so this routine will *always* return
after calling cpufreq_frequency_table_verify().
This can be done using the generic callback provided by core, lets use
it instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This routine can't fail unless the frequency table is invalid and
doesn't contain any valid entries.
Make it return the index and WARN() in case it is used for an invalid
table.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is already present as part of the policy and so no need to pass it
from the caller. Also, 'freq_table' is guaranteed to be valid in this
function and so no need to check it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The policy already has this pointer set, use it instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is absolutely no need to keep a copy to the freq-table in 'struct
od_policy_dbs_info'. Use policy->freq_table instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Most of the callers of cpufreq_frequency_get_table() already have the
pointer to a valid 'policy' structure and they don't really need to go
through the per-cpu variable first and then a check to validate the
frequency, in order to find the freq-table for the policy.
Directly use the policy->freq_table field instead for them.
Only one user of that API is left after above changes, cpu_cooling.c and
it accesses the freq_table in a racy way as the policy can get freed in
between.
Fix it by using cpufreq_cpu_get() properly.
Since there are no more users of cpufreq_frequency_get_table() left, get
rid of it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Javi Merino <javi.merino@arm.com> (cpu_cooling.c)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
These aren't required at all, remove them.
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When turbo is disabled, the ->set_policy() interface is broken.
For example, when turbo is disabled and cpuinfo.max = 2900000 (full
max turbo frequency), setting the limits results in frequency less
than the requested one:
Set 1000000 KHz results in 0700000 KHz
Set 1500000 KHz results in 1100000 KHz
Set 2000000 KHz results in 1500000 KHz
This is because the limits->max_perf fraction is calculated using
the max turbo frequency as the reference, but when the max P-State is
capped in intel_pstate_get_min_max(), the reference is not the max
turbo P-State. This results in reducing max P-State.
One option is to always use max turbo as reference for calculating
limits. But this will not be correct. By definition the intel_pstate
sysfs limits, shows percentage of available performance. So when
BIOS has disabled turbo, the available performance is max non turbo.
So the max_perf_pct should still show 100%.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Subject & changelog, rewrite in fewer lines of code ]
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The limits->max_perf is rounded_up but immediately overwritten by
another assignment to limits->max_perf.
Move that operation to the correct location.
While here also added a pr_debug() call in ->set_policy to aid in
debugging.
Fixes: 785ee27881 (cpufreq: intel_pstate: Fix limits->max_perf rounding error)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Subject & changelog ]
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All cpufreq drivers with a freq-table are migrated to use
cpufreq_table_validate_and_show() long back and the routine
cpufreq_frequency_table_cpuinfo() isn't used outside of cpufreq core
now.
Unexport it and update Documentation as well.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The modularity of cpufreq_stats is quite problematic.
First off, the usage of policy notifiers for the initialization
and cleanup in the cpufreq_stats module is inherently racy with
respect to CPU offline/online and the initialization and cleanup
of the cpufreq driver.
Second, fast frequency switching (used by the schedutil governor)
cannot be enabled if any transition notifiers are registered, so
if the cpufreq_stats module (that registers a transition notifier
for updating transition statistics) is loaded, the schedutil governor
cannot use fast frequency switching.
On the other hand, allowing cpufreq_stats to be built as a module
doesn't really add much value. Arguably, there's not much reason
for that code to be modular at all.
For the above reasons, make the cpufreq stats code non-modular,
modify the core to invoke functions provided by that code directly
and drop the notifiers from it.
Make the stats sysfs attributes appear empty if fast frequency
switching is enabled as the statistics will not be updated in that
case anyway (and returning -EBUSY from those attributes breaks
powertop).
While at it, clean up Kconfig help for the CPU_FREQ_STAT and
CPU_FREQ_STAT_DETAILS options.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Use clamp_val() instead of open coding it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The sequence got a bit wrong as we are sending CPUFREQ_START
notifications even before we have sent CPUFREQ_CREATE_POLICY.
Fix it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The 'initialized' field in struct cpufreq_governor is only used by
the conservative governor (as a usage counter) and the way that
happens is far from straightforward and arguably incorrect.
Namely, the value of 'initialized' is checked by
cpufreq_dbs_governor_init() and cpufreq_dbs_governor_exit() and
the results of those checks are passed (as the second argument) to
the ->init() and ->exit() callbacks in struct dbs_governor. Those
callbacks are only implemented by the ondemand and conservative
governors and ondemand doesn't use their second argument at all.
In turn, the conservative governor uses it to decide whether or not
to either register or unregister a transition notifier.
That whole mechanism is not only unnecessarily convoluted, but also
racy, because the 'initialized' field of struct cpufreq_governor is
updated in cpufreq_init_governor() and cpufreq_exit_governor() under
policy->rwsem which doesn't help if one of these functions is run
twice in parallel for different policies (which isn't impossible in
principle), for example.
Instead of it, add a proper usage counter to the conservative
governor and update it from cs_init() and cs_exit() which is
guaranteed to be non-racy, as those functions are only called
under gov_dbs_data_mutex which is global.
With that in place, drop the 'initialized' field from struct
cpufreq_governor as it is not used any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Create a new helper to avoid code duplication across governors.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
pr_*() helpers already prefix the print messages with
"cpufreq_governor:" and similar details aren't required in the actual
message.
For example, the print message getting fixed looks like this before this
patch:
cpufreq_governor: cpufreq: Governor initialization failed (dbs_data kobject init error 0)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
These aren't required anymore as the allocation core already prints such
messages. Remove the redundant ones.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The design of the cpufreq governor API is not very straightforward,
as struct cpufreq_governor provides only one callback to be invoked
from different code paths for different purposes. The purpose it is
invoked for is determined by its second "event" argument, causing it
to act as a "callback multiplexer" of sorts.
Unfortunately, that leads to extra complexity in governors, some of
which implement the ->governor() callback as a switch statement
that simply checks the event argument and invokes a separate function
to handle that specific event.
That extra complexity can be eliminated by replacing the all-purpose
->governor() callback with a family of callbacks to carry out specific
governor operations: initialization and exit, start and stop and policy
limits updates. That also turns out to reduce the code size too, so
do it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The return value of clamp_val() has to be stored actually.
Fixes: b7898fda5b (cpufreq: Support for fast frequency switching)
Reported-by: Steve Muckle <steve.muckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Downgrade pr_info to pr_debug for the "_PPC limits will be enforced"
message.
In server systems with many cores this message is annoying.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq_governor() routine is used by the cpufreq core to invoke
the current governor's ->governor() callback with appropriate arguments
and do some housekeeping related to that. Unfortunately, the way it
mixes different governor events in one code path makes it rather hard
to follow the code.
For this reason, split cpufreq_governor() into five simpler functions
that each will handle just one specific governor event and put all of
the code related to the given event into its own function.
This change is a prerequisite for a redesign of the cpufreq governor
API that will be done subsequently.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The performance and powersave cpufreq governors handle the
CPUFREQ_GOV_START event in the same way as CPUFREQ_GOV_LIMITS.
However, the cpufreq core always invokes cpufreq_governor() with the
event argument equal to CPUFREQ_GOV_LIMITS right after invoking it with
event equal to CPUFREQ_GOV_START. As a result, for both the governors
in question, __cpufreq_driver_target() is executed twice in a row
with the same arguments which is not useful.
For this reason, simplify the performance and powersave governors
to handle the CPUFREQ_GOV_LIMITS event only as that's going to be
sufficient for the governor start too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
It is not necessary to check the governor's max_transition_latency
attribute every time cpufreq_governor() runs, so check it only if
the event argument is CPUFREQ_GOV_POLICY_INIT.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
None of the cpufreq governors currently in the tree will ever fail
an invocation of the ->governor() callback with the event argument
equal to CPUFREQ_GOV_LIMITS (unless invoked with incorrect arguments
which doesn't matter anyway) and had it ever failed, the result of
it wouldn't have been very clean.
For this reason, rearrange the code in the core to ignore the return
value of cpufreq_governor() when called with event equal to
CPUFREQ_GOV_LIMITS.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull thermal management updates from Zhang Rui:
- Introduce generic ADC thermal driver, based on OF thermal (Laxman
Dewangan)
- Introduce new thermal driver for Tango chips (Marc Gonzalez)
- Rockchip driver support for RK3399, RK3366, and some fixes (Caesar
Wang, Elaine Zhang and Shawn Lin)
- Add CPU power cooling model to Mediatek thermal driver (Dawei Chien)
- Wider usage of dev_thermal_zone_of_sensor_register (Eduardo Valentin)
- TI thermal driver gained a new maintainer (Keerthy).
- Enabled powerclamp driver by checking CPU feature and package cstate
counter instead of CPU whitelist (Jacob Pan)
- Various fixes on thermal governor, OF thermal, Tegra, and RCAR
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (50 commits)
thermal: tango: initialize TEMPSI_CFG
thermal: rockchip: use the usleep_range instead of udelay
thermal: rockchip: add the notes for better reading
thermal: rockchip: Support RK3366 SoCs in the thermal driver
thermal: rockchip: handle the power sequence for tsadc controller
thermal: rockchip: update the tsadc table for rk3399
thermal: rockchip: fixes the code_to_temp for tsadc driver
thermal: rockchip: disable thermal->clk in err case
thermal: tegra: add Tegra132 specific SOC_THERM driver
thermal: fix ptr_ret.cocci warnings
thermal: mediatek: Add cpu dynamic power cooling model.
thermal: generic-adc: Add ADC based thermal sensor driver
thermal: generic-adc: Add DT binding for ADC based thermal sensor
thermal: tegra: fix static checker warning
thermal: tegra: mark PM functions __maybe_unused
thermal: add temperature sensor support for tango SoC
thermal: hisilicon: fix IRQ imbalance enabling
thermal: hisilicon: support to use any sensor
thermal: rcar: Remove binding docs for r8a7794
thermal: tegra: add PM support
...
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all
devices regardless of whether or not async suspend/resume is
enabled for them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXRglcAAoJEILEb/54YlRxCwEQAKQNLO5wrNnOQmVrMHaOk1XH
4zT+AdggLRVgBld4d4Vcli2zJPZfpnZuzjJdwTB5zLgQ4WIcBb6meOfH87XGqRyJ
o4ksyEUhpvDk8AdlmTA5CvvjFuydPJG5ZUSiM035XRT9heebvhgyaMBnT3ucXbq9
7LhNhCQ+a8arndt9ePO7tZnFfQQUbwNJ2BDVuH5DJBqMIFOo2/Kpag43CdFWlWZT
jnWaleDCjSmanuJ/45bFJHJeSZ7PK2etnArfzKtb9QLSGnuEfFPdHuUzJYo5dkP7
UBeYA94hhfR3f5FJIqNlF3N+eLEX1idpwxc8+CJLLDKDd1ZCBrbLoz5fwM+fVn0h
AfmyR+J1czcbiphsmpViOYDRrKdiQVkbP6SpBswvgMCZAcNDF2bxhzOlcuTUc+u0
8xsjWOtArL6uvzsAHa1HY6hhgUn9FB8m20HX+DmS2/zzqyzoRefenoyVcuLsAhXC
fm+sARQ7tvy3OoGRQ9mloWgv2X5iQUY5IVjOG2amIhbUvVmKQutPjVTGTwHqmmcb
2nNYptLsTA6crvnexPcPHY+OFjkQl/omtfaMx+OJl63yhln5ibveGOfZ6F8sPdoB
bRqHuHoK/xh9hSNwj117ZFzq1nm54mLjh0Yhw3EXFcV4I9vdsTp8/WeNThGvT17j
M+6PDXyjlwh3HZpGm+HW
=3vtL
-----END PGP SIGNATURE-----
Merge tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are two stable-candidate fixes (PM core, cpuidle) and a bunch of
cpufreq cleanups.
Specifics:
- Stable-candidate cpuidle fix to make it check the right variable
when deciding whether or not to enable interrupts on the local CPU
so as to avoid enabling iterrupts too early in some cases if the
system has both coupled and per-core idle states (Daniel Lezcano).
- Stable-candidate PM core fix to make it handle failures at the
"late suspend" stage of device suspend consistently for all devices
regardless of whether or not async suspend/resume is enabled for
them (Rafael Wysocki).
- Cleanups in the cpufreq core, the schedutil governor and the
intel_pstate driver (Rafael Wysocki, Pankaj Gupta, Viresh Kumar)"
* tag 'pm-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Handle failures in device_suspend_late() consistently
cpufreq: schedutil: Improve prints messages with pr_fmt
cpuidle: Fix cpuidle_state_is_coupled() argument in cpuidle_enter()
cpufreq: simplified goto out in cpufreq_register_driver()
cpufreq: governor: CPUFREQ_GOV_STOP never fails
cpufreq: governor: CPUFREQ_GOV_POLICY_EXIT never fails
intel_pstate: Simplify conditional in intel_pstate_set_policy()
Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
- Check periodically the coherent platform function's state from Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
workaround, and a kconfig dependency fix."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
/AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
/ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
udQES+t3F/9gvtxgxtDe
=YvyQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
Bauermann, Valentin Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan
Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
from Guilherme G Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme
G Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica
Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe
Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic
Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian
Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs
from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled
from Ian Munsie
- Check periodically the coherent platform function's state from
Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
an erratum workaround, and a kconfig dependency fix."
* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
powerpc/86xx: Fix PCI interrupt map definition
powerpc/86xx: Move pci1 definition to the include file
powerpc/fsl: Fix build of the dtb embedded kernel images
powerpc/fsl: Fix rcpm compatible string
powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
powerpc/fsl-pci: Add a workaround for PCI 5 errata
powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
powerpc/powernv/npu: Add PE to PHB's list
powerpc/powernv: Fix insufficient memory allocation
powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
powerpc/powernv/npu: Enable NVLink pass through
powerpc/powernv/npu: Rework TCE Kill handling
powerpc/powernv/npu: Add set/unset window helpers
powerpc/powernv/ioda2: Export debug helper pe_level_printk()
...
Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS for 4.7. Here's the summary of
the changes:
- ATH79: Support for DTB passuing using the UHI boot protocol
- ATH79: Remove support for builtin DTB.
- ATH79: Add zboot debug serial support.
- ATH79: Add initial support for Dragino MS14 (Dragine 2), Onion Omega
and DPT-Module.
- ATH79: Update devicetree clock support for AR9132 and AR9331.
- ATH79: Cleanup the DT code.
- ATH79: Support newer SOCs in ath79_ddr_ctrl_init.
- ATH79: Fix regression in PCI window initialization.
- BCM47xx: Move SPROM driver to drivers/firmware/
- BCM63xx: Enable partition parser in defconfig.
- BMIPS: BMIPS5000 has I cache filing from D cache
- BMIPS: BMIPS: Add cpu-feature-overrides.h
- BMIPS: Add Whirlwind support
- BMIPS: Adjust mips-hpt-frequency for BCM7435
- BMIPS: Remove maxcpus from BCM97435SVMB DTS
- BMIPS: Add missing 7038 L1 register cells to BCM7435
- BMIPS: Various tweaks to initialization code.
- BMIPS: Enable partition parser in defconfig.
- BMIPS: Cache tweaks.
- BMIPS: Add UART, I2C and SATA devices to DT.
- BMIPS: Add BCM6358 and BCM63268support
- BMIPS: Add device tree example for BCM6358.
- BMIPS: Improve Improve BCM6328 and BCM6368 device trees
- Lantiq: Add support for device tree file from boot loader
- Lantiq: Allow build with no built-in DT.
- Loongson 3: Reserve 32MB for RS780E integrated GPU.
- Loongson 3: Fix build error after ld-version.sh modification
- Loongson 3: Move chipset ACPI code from drivers to arch.
- Loongson 3: Speedup irq processing.
- Loongson 3: Add basic Loongson 3A support.
- Loongson 3: Set cache flush handlers to nop.
- Loongson 3: Invalidate special TLBs when needed.
- Loongson 3: Fast TLB refill handler.
- MT7620: Fallback strategy for invalid syscfg0.
- Netlogic: Fix CP0_EBASE redefinition warnings
- Octeon: Initialization fixes
- Octeon: Add DTS files for the D-Link DSR-1000N and EdgeRouter Lite
- Octeon: Enable add Octeon-drivers in cavium_octeon_defconfig
- Octeon: Correctly handle endian-swapped initramfs images.
- Octeon: Support CN73xx, CN75xx and CN78xx.
- Octeon: Remove dead code from cvmx-sysinfo.
- Octeon: Extend number of supported CPUs past 32.
- Octeon: Remove some code limiting NR_IRQS to 255.
- Octeon: Simplify octeon_irq_ciu_gpio_set_type.
- Octeon: Mark some functions __init in smp.c
- Octeon: Octeon: Add Octeon III CN7xxx interface detection
- PIC32: Add serial driver and bindings for it.
- PIC32: Add PIC32 deadman timer driver and bindings.
- PIC32: Add PIC32 clock timer driver and bindings.
- Pistachio: Determine SoC revision during boot
- Sibyte: Fix Kconfig dependencies of SIBYTE_BUS_WATCHER.
- Sibyte: Strip redundant comments from bcm1480_regs.h.
- Panic immediately if panic_on_oops is set.
- module: fix incorrect IS_ERR_VALUE macro usage.
- module: Make consistent use of pr_*
- Remove no longer needed work_on_cpu() call.
- Remove CONFIG_IPV6_PRIVACY from defconfigs.
- Fix registers of non-crashing CPUs in dumps.
- Handle MIPSisms in new vmcore_elf32_check_arch.
- Select CONFIG_HANDLE_DOMAIN_IRQ and make it work.
- Allow RIXI to be used on non-R2 or R6 cores.
- Reserve nosave data for hibernation
- Fix siginfo.h to use strict POSIX types.
- Don't unwind user mode with EVA.
- Fix watchpoint restoration
- Ptrace watchpoints for R6.
- Sync icache when it fills from dcache
- I6400 I-cache fills from dcache.
- Various MSA fixes.
- Cleanup MIPS_CPU_* definitions.
- Signal: Move generic copy_siginfo to signal.h
- Signal: Fix uapi include in exported asm/siginfo.h
- Timer fixes for sake of KVM.
- XPA TLB refill fixes.
- Treat perf counter feature
- Update John Crispin's email address
- Add PIC32 watchdog and bindings.
- Handle R10000 LL/SC bug in set_pte()
- cpufreq: Various fixes for Longson1.
- R6: Fix R2 emulation.
- mathemu: Cosmetic fix to ADDIUPC emulation, plenty of other small fixes
- ELF: ABI and FP fixes.
- Allow for relocatable kernel and use that to support KASLR.
- Fix CPC_BASE_ADDR mask
- Plenty fo smp-cps, CM, R6 and M6250 fixes.
- Make reset_control_ops const.
- Fix kernel command line handling of leading whitespace.
- Cleanups to cache handling.
- Add brcm, bcm6345-l1-intc device tree bindings.
- Use generic clkdev.h header
- Remove CLK_IS_ROOT usage.
- Misc small cleanups.
- CM: Fix compilation error when !MIPS_CM
- oprofile: Fix a preemption issue
- Detect DSP ASE v3 support:1"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (275 commits)
MIPS: pic32mzda: fix getting timer clock rate.
MIPS: ath79: fix regression in PCI window initialization
MIPS: ath79: make ath79_ddr_ctrl_init() compatible for newer SoCs
MIPS: Fix VZ probe gas errors with binutils <2.24
MIPS: perf: Fix I6400 event numbers
MIPS: DEC: Export `ioasic_ssr_lock' to modules
MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC
MIPS: CM: Fix compilation error when !MIPS_CM
MIPS: Fix genvdso error on rebuild
USB: ohci-jz4740: Remove obsolete driver
MIPS: JZ4740: Probe OHCI platform device via DT
MIPS: JZ4740: Qi LB60: Remove support for AVT2 variant
MIPS: pistachio: Determine SoC revision during boot
MIPS: BMIPS: Adjust mips-hpt-frequency for BCM7435
mips: mt7620: fallback to SDRAM when syscfg0 does not have a valid value for the memory type
MIPS: Prevent "restoration" of MSA context in non-MSA kernels
MIPS: cevt-r4k: Dynamically calculate min_delta_ns
MIPS: malta-time: Take seconds into account
MIPS: malta-time: Start GIC count before syncing to RTC
MIPS: Force CPUs to lose FP context during mode switches
...
None of the cpufreq governors currently in the tree will ever fail
an invocation of the ->governor() callback with the event argument
equal to CPUFREQ_GOV_STOP (unless invoked with incorrect arguments
which doesn't matter anyway) and it is rather difficult to imagine
a valid reason for such a failure.
Accordingly, rearrange the code in the core to make it clear that
this call never fails.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
None of the cpufreq governors currently in the tree will ever fail
an invocation of the ->governor() callback with the event argument
equal to CPUFREQ_GOV_POLICY_EXIT (unless invoked with incorrect
arguments which doesn't matter anyway) and it wouldn't really
make sense to fail it, because the caller won't be able to handle
that failure in a meaningful way.
Accordingly, rearrange the code in the core to make it clear that
this call never fails.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
One of the if () statements in intel_pstate_set_policy() causes
another if () to be evaluated if the condition is true and it
doesn't do anything else, so merge the two if () statements into
one.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
- New cpufreq "schedutil" governor (making decisions based on CPU
utilization information provided by the scheduler and capable of
switching CPU frequencies right away if the underlying driver
supports that) and support for fast frequency switching in the
acpi-cpufreq driver (Rafael Wysocki).
- Consolidation of CPU frequency management on ARM platforms allowing
them to get rid of some platform-specific boilerplate code if they
are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao,
Marc Gonzalez).
- Support for ACPI _PPC and CPU frequency limits in the intel_pstate
driver (Srinivas Pandruvada).
- Fixes and cleanups in the cpufreq core and generic governor code
(Rafael Wysocki, Sai Gurrappadi).
- intel_pstate driver optimizations and cleanups (Rafael Wysocki,
Philippe Longepe, Chen Yu, Joe Perches).
- cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri
Bhat).
- cpufreq qoriq driver fixes and cleanups (Jia Hongtao).
- ACPI cpufreq driver cleanups (Viresh Kumar).
- Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang,
Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla).
- Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann).
- Fixes and cleanups in the OPP (Operating Performance Points)
framework, mostly related to OPP sharing, and reorganization of
OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla).
- New "passive" governor for devfreq (for SoC subsystems that will
rely on someone else for the management of their power resources)
and consolidation of devfreq support for Exynos platforms, coding
style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham).
- PM core fixes and cleanups, mostly to make it work better with the
generic power domains (genpd) framework, and updates for that
framework (Ulf Hansson, Thierry Reding, Colin Ian King).
- Intel Broxton support for the intel_idle driver (Len Brown).
- cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach).
- ARM cpuidle cleanups (Jisheng Zhang).
- Intel Kabylake support for the RAPL power capping driver (Jacob Pan).
- AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko
Stuebner).
- Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King,
Mattia Dongili, Thomas Renninger).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXOjLgAAoJEILEb/54YlRxfn0P/RbSPpNlUNBIE8DFrdD9jRdJ
TIpZ7uiHi9tU1ZF17UBbb/SwuWfYVnVmiorZGRfFOtGaoqh0HFZ/nplDz99rK0ku
vW2OnbojMQEUMU3IcUT1y4BsSl0H23f7ZOKrdprALeWxDQmbgnYjrE6vkX6hRtld
A8eeZvIEJ5CzV8S+9aOOOpojW2yXk5dYGdZ7gpQdoM0n7zVLyPnNucJoha3BYmOG
FwKEIe05RpIhfLfGT0CXIRcOzwAZ6ZWKgOrXUrx/AadPbvu/TP9zkI0djYI8ukyv
z2oiO/GExoeGVuUzvy8vY5SiH4NQvViftFzMZepcsmjxmVglohMPRL8VLjZIBckk
DDcqH9e0OQI20jjYT1vIf5+JWBvLxuQfGtyzI0S+sE/elB1zI/3O8p+8N2CuF5n+
my2dawIewnHI/0AdSpJ+K7DVrfwPHAX19axtPX3dJSLh2OuHCPNlAtbxRGAriBfH
Zv9NETxlrch69o2AD4K54DErWV1FsYLznzK5Zms6MC2Ispbb+oiYpacTlZblznvb
H5U2SSNlA5Niir3vVJ01nKRtzxlWoi67CQxbYrGhlaR0nTTxf9HqWgcSiTZrn7Pv
hs+LA2aUfMf3JGjStdORS7S8biQSid5vypfkglpWLZBKHNC9BqqZd9gSM+jF3FVh
ps4mMM4UXY4hnoFDkMBI
=WM89
-----END PGP SIGNATURE-----
Merge tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The majority of changes go into the cpufreq subsystem this time.
To me, quite obviously, the biggest ticket item is the new "schedutil"
governor. Interestingly enough, it's the first new cpufreq governor
since the beginning of the git era (except for some out-of-the-tree
ones).
There are two main differences between it and the existing governors.
First, it uses the information provided by the scheduler directly for
making its decisions, so it doesn't have to track anything by itself.
Second, it can invoke drivers (supporting that feature) to adjust CPU
performance right away without having to spawn work items to be
executed in process context or similar. Currently, the acpi-cpufreq
driver is the only one supporting that mode of operation, but then it
is used on a large number of systems.
The "schedutil" governor as included here is very simple and mostly
regarded as a foundation for future work on the integration of the
scheduler with CPU power management (in fact, there is work in
progress on top of it already). Nevertheless it works and the
preliminary results obtained with it are encouraging.
There also is some consolidation of CPU frequency management for ARM
platforms that can add their machine IDs the the new stub dt-platdev
driver now and that will take care of creating the requisite platform
device for cpufreq-dt, so it is not necessary to do that in platform
code any more. Several ARM platforms are switched over to using this
generic mechanism.
In addition to that, the intel_pstate driver is now going to respect
CPU frequency limits set by the platform firmware (or a BMC) and
provided via the ACPI _PPC object.
The devfreq subsystem is getting a new "passive" governor for SoCs
subsystems that will depend on somebody else to manage their voltage
rails and its support for Samsung Exynos SoCs is consolidated.
The rest is support for new hardware (Intel Broxton support in
intel_idle for one example), bug fixes, optimizations and cleanups in
a number of places.
Specifics:
- New cpufreq "schedutil" governor (making decisions based on CPU
utilization information provided by the scheduler and capable of
switching CPU frequencies right away if the underlying driver
supports that) and support for fast frequency switching in the
acpi-cpufreq driver (Rafael Wysocki)
- Consolidation of CPU frequency management on ARM platforms allowing
them to get rid of some platform-specific boilerplate code if they
are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao,
Marc Gonzalez)
- Support for ACPI _PPC and CPU frequency limits in the intel_pstate
driver (Srinivas Pandruvada)
- Fixes and cleanups in the cpufreq core and generic governor code
(Rafael Wysocki, Sai Gurrappadi)
- intel_pstate driver optimizations and cleanups (Rafael Wysocki,
Philippe Longepe, Chen Yu, Joe Perches)
- cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri
Bhat)
- cpufreq qoriq driver fixes and cleanups (Jia Hongtao)
- ACPI cpufreq driver cleanups (Viresh Kumar)
- Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang,
Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla)
- Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann)
- Fixes and cleanups in the OPP (Operating Performance Points)
framework, mostly related to OPP sharing, and reorganization of
OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla)
- New "passive" governor for devfreq (for SoC subsystems that will
rely on someone else for the management of their power resources)
and consolidation of devfreq support for Exynos platforms, coding
style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham)
- PM core fixes and cleanups, mostly to make it work better with the
generic power domains (genpd) framework, and updates for that
framework (Ulf Hansson, Thierry Reding, Colin Ian King)
- Intel Broxton support for the intel_idle driver (Len Brown)
- cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach)
- ARM cpuidle cleanups (Jisheng Zhang)
- Intel Kabylake support for the RAPL power capping driver (Jacob
Pan)
- AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko
Stuebner)
- Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King,
Mattia Dongili, Thomas Renninger)"
* tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (112 commits)
intel_pstate: Clean up get_target_pstate_use_performance()
intel_pstate: Use sample.core_avg_perf in get_avg_pstate()
intel_pstate: Clarify average performance computation
intel_pstate: Avoid unnecessary synchronize_sched() during initialization
cpufreq: schedutil: Make default depend on CONFIG_SMP
cpufreq: powernv: del_timer_sync when global and local pstate are equal
cpufreq: powernv: Move smp_call_function_any() out of irq safe block
intel_pstate: Clean up intel_pstate_get()
cpufreq: schedutil: Make it depend on CONFIG_SMP
cpufreq: governor: Fix handling of special cases in dbs_update()
PM / OPP: Move CONFIG_OF dependent code in a separate file
cpufreq: intel_pstate: Ignore _PPC processing under HWP
cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table
PM / OPP: add non-OF versions of dev_pm_opp_{cpumask_, }remove_table
cpufreq: tango: Use generic platdev driver
PM / OPP: pass cpumask by reference
cpufreq: Fix GOV_LIMITS handling for the userspace governor
cpupower: fix potential memory leak
PM / devfreq: style/typo fixes
PM / devfreq: exynos: Add the detailed correlation for Exynos5422 bus
..
Pull x86 asm updates from Ingo Molnar:
"The main changes in this cycle were:
- MSR access API fixes and enhancements (Andy Lutomirski)
- early exception handling improvements (Andy Lutomirski)
- user-space FS/GS prctl usage fixes and improvements (Andy
Lutomirski)
- Remove the cpu_has_*() APIs and replace them with equivalents
(Borislav Petkov)
- task switch micro-optimization (Brian Gerst)
- 32-bit entry code simplification (Denys Vlasenko)
- enhance PAT handling in enumated CPUs (Toshi Kani)
... and lots of other cleanups/fixlets"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS
x86/entry/32: Remove asmlinkage_protect()
x86/entry/32: Remove GET_THREAD_INFO() from entry code
x86/entry, sched/x86: Don't save/restore EFLAGS on task switch
x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs
selftests/x86/ldt_gdt: Test set_thread_area() deletion of an active segment
x86/tls: Synchronize segment registers in set_thread_area()
x86/asm/64: Rename thread_struct's fs and gs to fsbase and gsbase
x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization
x86/segments/64: When load_gs_index fails, clear the base
x86/segments/64: When loadsegment(fs, ...) fails, clear the base
x86/asm: Make asm/alternative.h safe from assembly
x86/asm: Stop depending on ptrace.h in alternative.h
x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()
x86/asm: Make sure verify_cpu() has a good stack
x86/extable: Add a comment about early exception handlers
x86/msr: Set the return value to zero when native_rdmsr_safe() fails
x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
x86/paravirt: Add paravirt_{read,write}_msr()
x86/msr: Carry on after a non-"safe" MSR access fails
...
The comments and the core_busy variable name in
get_target_pstate_use_performance() are totally confusing,
so modify them to reflect what's going on.
The results of the computations should be the same as before.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notice that get_avg_pstate() can use sample.core_avg_perf instead of
carrying the same division again, so make it do that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The core_pct_busy field of struct sample actually contains the
average performace during the last sampling period (in percent)
and not the utilization of the core as suggested by its name
which is confusing.
For this reason, change the name of that field to core_avg_perf
and rename the function that computes its value accordingly.
Also notice that storing this value as percentage requires a costly
integer multiplication to be carried out in a hot path, so instead
store it as an "extended fixed point" value with more fraction bits
and update the code using it accordingly (it is better to change the
name of the field along with its meaning in one go than to make those
two changes separately, as that would likely lead to more
confusion).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, in intel_pstate_clear_update_util_hook(), after
clearing the utilization update hook, we leverage
synchronize_sched() to deal with synchronization, which
is a little bit time-costly because synchronize_sched()
has to wait for all the CPUs to go through a grace period.
Actually, the synchronize_sched() is not necessary if the utilization
update hook has not been set for the given CPU yet, so make the driver
check if that's the case and avoid the synchronize_sched() call then.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=116371
Tested-by: Tian Ye <yex.tian@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw : Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CPU_FREQ_GOV_SCHEDUTIL gained a dependency on SMP, so now we
get a warning if it gets selected by CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
without SMP:
warning: (CPU_FREQ_DEFAULT_GOV_SCHEDUTIL) selects CPU_FREQ_GOV_SCHEDUTIL which has unmet direct dependencies (CPU_FREQ && SMP)
This adds another dependency to avoid the problem.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: bf7cdff194 (cpufreq: schedutil: Make it depend on CONFIG_SMP)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When global and local pstate are equal in a powernv_target_index() call,
we don't queue a timer. But we may have timer already queued for future.
This could cause the timer to fire one additional time for no use.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix a WARN_ON caused by smp_call_function_any() when irq is disabled,
because of changes made in the patch ('cpufreq: powernv: Ramp-down
global pstate slower than local-pstate')
https://patchwork.ozlabs.org/patch/612058/
WARNING: CPU: 0 PID: 4 at kernel/smp.c:291
smp_call_function_single+0x170/0x180
Call Trace:
[c0000007f648f9f0] [c0000007f648fa90] 0xc0000007f648fa90 (unreliable)
[c0000007f648fa30] [c0000000001430e0] smp_call_function_any+0x170/0x1c0
[c0000007f648fa90] [c0000000007b4b00]
powernv_cpufreq_target_index+0xe0/0x250
[c0000007f648fb00] [c0000000007ac9dc]
__cpufreq_driver_target+0x20c/0x3d0
[c0000007f648fbc0] [c0000000007b1b4c] od_dbs_timer+0xcc/0x260
[c0000007f648fc10] [c0000000007b3024] dbs_work_handler+0x54/0xa0
[c0000007f648fc50] [c0000000000c49a8] process_one_work+0x1d8/0x590
[c0000007f648fce0] [c0000000000c4e08] worker_thread+0xa8/0x660
[c0000007f648fd80] [c0000000000cca88] kthread+0x108/0x130
[c0000007f648fe30] [c0000000000095e8] ret_from_kernel_thread+0x5c/0x74
- Calling smp_call_function_any() with interrupt disabled (through
spin_lock_irqsave) could cause a deadlock, as smp_call_function_any()
relies on the IPI to complete. This is detected in the
smp_call_function_any() call and hence the WARN_ON.
- As the spinlock (gpstates->lock) is only used to synchronize access of
global_pstate_info between timer irq handler and target_index calls. And
the timer irq handler just try_locks() hence it would not cause a
deadlock. Hence could do without making spinlocks irq safe.
- As the smp_call_function_any() is a blocking call and does not access
global_pstates_info, it could reduce the critcal section by moving
smp_call_function_any() after giving up the lock.
Reported-by: Abdul Haleem <abdhalee@linux.vnet.linux.com>
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
intel_pstate_get() contains a local variable that's initialized but
never used and it can be written in fewer lines of code, so clean
it up.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Make the schedutil cpufreq governor depend on CONFIG_SMP, because
the scheduler-provided utilization numbers used by it are only
available with CONFIG_SMP set.
Fixes: 9bdcb44e39 (cpufreq: schedutil: New governor based on scheduler utilization data)
Reported-by: Steve Muckle <steve.muckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq-fixes:
intel_pstate: Fix intel_pstate_get()
cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
cpufreq: st: enable selective initialization based on the platform
cpufreq: intel_pstate: Fix processing for turbo activation ratio
As reported in KBZ 69821:
"With CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequcency 800MHz
even if usage goes to 100%, frequency does not scale up, the governor
in use is ondemand. Neither works conservative. Performance and
userspace governors work as expected.
With CONFIG_NO_HZ_IDLE or CONFIG_NO_HZ_FULL cpu scales up with ondemand
as expected."
Analysis carried out by Chen Yu leads to the conclusion that the
observed issue is due to idle_time in dbs_update() representing a
negative number in which case the function will return 0 as the load
(unless load is greater than 0 for another CPU sharing the policy),
although that need not be the right choice.
Indeed, idle_time representing a negative number means that during
the last sampling interval the CPU was almost 100% busy on the rough
average, so 100 should be returned as the load in that case.
Modify the code accordingly and rearrange it to clarify the handling
of all of the special cases in it. While at it, also avoid returning
zero as the load if time_elapsed is 0 (it doesn't really make sense
to return 0 then).
Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Timo Valtoaho <timo.valtoaho@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* pm-opp-fixes:
PM / OPP: Remove useless check
* pm-cpufreq-fixes:
intel_pstate: Fix intel_pstate_get()
cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
cpufreq: st: enable selective initialization based on the platform
* pm-cpuidle-fixes:
ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
When HWP (hardware P states) feature is active, the ACPI _PSS and _PPC
is not used. So ignore processing for _PPC limits.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently when performing random CPU hot-plugs and suspend-to-ram(S2R)
on systems using arm_big_little cpufreq driver, we get warnings similar
to something like below:
cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 600000000,
volt: 800000, enabled: 1. New: freq: 600000000, volt: 800000, enabled: 1
This is mainly because the OPPs for the shared cpus are not set. We can
just use dev_pm_opp_of_cpumask_add_table in case the OPPs are obtained
from DT(arm_big_little_dt.c) or use dev_pm_opp_set_sharing_cpus if the
OPPs are obtained by other means like firmware(e.g. scpi-cpufreq.c)
Also now that the generic dev_pm_opp{,_of}_cpumask_remove_table can
handle removal of opp table and entries for all associated CPUs, we can
re-use dev_pm_opp{,_of}_cpumask_remove_table as free_opp_table in
cpufreq_arm_bL_ops.
This patch makes necessary changes to reuse the generic OPP functions for
{init,free}_opp_table and thereby eliminating the warnings.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add tango4 compatible string to the list.
Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, the userspace governor only updates frequency on GOV_LIMITS
if policy->cur falls outside policy->{min/max}. However, it is also
necessary to update current frequency on GOV_LIMITS to match the user
requested value if it can be achieved within the new policy->{max/min}.
This was previously the behaviour in the governor until commit d1922f0
("cpufreq: Simplify userspace governor") which incorrectly assumed that
policy->cur == user requested frequency via scaling_setspeed. This won't
be true if the user requested frequency falls outside policy->{min/max}.
Ex: a temporary thermal cap throttled the user requested frequency.
Fix this by storing the user requested frequency in a seperate variable.
The governor will then try to achieve this request on every GOV_LIMITS
change.
Fixes: d1922f0256 (cpufreq: Simplify userspace governor)
Signed-off-by: Sai Gurrappadi <sgurrappadi@nvidia.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 8fa520af50 "intel_pstate: Remove freq calculation from
intel_pstate_calc_busy()" intel_pstate_get() calls get_avg_frequency()
to compute the average frequency, which is problematic for two reasons.
First, intel_pstate_get() may be invoked before the driver reads the
CPU feedback registers for the first time and if that happens,
get_avg_frequency() will attempt to divide by zero.
Second, the get_avg_frequency() call in intel_pstate_get() is racy
with respect to intel_pstate_sample() and it may end up returning
completely meaningless values for this reason.
Moreover, after commit 7349ec0470 "intel_pstate: Move
intel_pstate_calc_busy() into get_target_pstate_use_performance()"
sample.core_pct_busy is never computed on Atom, but it is used in
intel_pstate_adjust_busy_pstate() in that case too.
To address those problems notice that if sample.core_pct_busy
was used in the average frequency computation carried out by
get_avg_frequency(), both the divide by zero problem and the
race with respect to intel_pstate_sample() would be avoided.
Accordingly, move the invocation of intel_pstate_calc_busy() from
get_target_pstate_use_performance() to intel_pstate_update_util(),
which also will take care of the uninitialized sample.core_pct_busy
on Atom, and modify get_avg_frequency() to use sample.core_pct_busy
as per the above.
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Link: http://marc.info/?l=linux-kernel&m=146226437623173&w=4
Fixes: 8fa520af50 "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()"
Fixes: 7349ec0470 "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()"
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 41cfd64cf4 "Update frequencies of policy->cpus only from
->set_policy()" changed the way the intel_pstate driver's ->set_policy
callback updates the HWP (hardware-managed P-states) settings.
A side effect of it is that if those settings are modified on the
boot CPU during system suspend and wakeup, they will never be
restored during subsequent system resume.
To address this problem, allow cpufreq drivers that don't provide
->target or ->target_index callbacks to use ->suspend and ->resume
callbacks and add a ->resume callback to intel_pstate to restore
the HWP settings on the CPUs that belong to the given policy.
Fixes: 41cfd64cf4 "Update frequencies of policy->cpus only from ->set_policy()"
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
How we switch MMU context differs between hash and radix. For hash we
need to switch the SLB details and for radix we need to switch the PID
SPR.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* pm-cpufreq-fixes:
cpufreq: intel_pstate: Fix processing for turbo activation ratio
Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC"
The sti-cpufreq does unconditional registration of the cpufreq-dt driver
which causes issue on an multi-platform build. For example, on Vexpress
TC2 platform, we get the following error on boot:
cpu cpu0: OPP-v2 not supported
cpu cpu0: Not doing voltage scaling
cpu: dev_pm_opp_of_cpumask_add_table: couldn't find opp table
for cpu:0, -19
cpu cpu0: dev_pm_opp_get_max_volt_latency: Invalid regulator (-6)
...
arm_big_little: bL_cpufreq_register: Failed registering platform driver:
vexpress-spc, err: -17
The actual driver fails to initialise as cpufreq-dt is probed
successfully, which is incorrect. This issue can happen to any platform
not using cpufreq-dt in a multi-platform build.
This patch adds a check to do selective initialization of the driver.
Fixes: ab0ea257fc (cpufreq: st: Provide runtime initialised driver for ST's platforms)
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Lee Jones <lee.jones@linaro.org>
Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move cpufreq bits for mvebu into drivers/cpufreq/ directory, that's
where they really belong to.
Compiled tested only.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are no more users of platform-data for cpufreq-dt driver, get rid
of it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Existing platforms, which do not support operating-points-v2, can
explicitly tell the opp core that some of the CPUs share opp tables,
with help of dev_pm_opp_set_sharing_cpus().
For such platforms, explicitly ask the opp core to provide list of CPUs
sharing the opp table with current cpu device, before falling back to
platform data.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The name of the prev_cpu_wall field in struct cpu_dbs_info is
confusing, because it doesn't represent wall time, but the previous
update time as returned by get_cpu_idle_time() (that may be the
current value of jiffies_64 in some cases, for example).
Moreover, the names of some related variables in dbs_update() take
that confusion further.
Rename all of those things to make their names reflect the purpose
more accurately. While at it, drop unnecessary parens from one of
the updated expressions.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Chen Yu <yu.c.chen@intel.com>
For platforms which are controlled via remove node manager, enable _PPC by
default. These platforms are mostly categorized as enterprise server or
performance servers. These platforms needs to go through some
certifications tests, which tests control via _PPC.
The relative risk of enabling by default is low as this is is less likely
that these systems have broken _PSS table.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When policy->max is changed via _PPC or sysfs and is more than the max non
turbo frequency, it does not really change resulting performance in some
processors. When policy->max results in a P-State ratio more than the
turbo activation ratio, then processor can choose any P-State up to max
turbo. So the user or _PPC setting has no value, but this can cause
undesirable side effects like:
- Showing reduced max percentage in Intel P-State sysfs
- It can cause reduced max performance under certain boundary conditions:
The requested max scaling frequency either via _PPC or via cpufreq-sysfs,
will be converted into a fixed floating point max percent scale. In
majority of the cases this will result in correct max. But not 100% of the
time. If the _PPC is requested at a point where the calculation lead to a
lower max, this can result in a lower P-State then expected and it will
impact performance.
Example of this condition using a Broadwell laptop with config TDP.
ACPI _PSS table from a Broadwell laptop
2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000
1300000 1100000 1000000 900000 800000 600000 500000
The actual results by disabling config TDP so that we can get what is
requested on or below 2300000Khz.
scaling_max_freq Max Requested P-State Resultant scaling
max
---------------------------------------- ----------------------
2400000 18 2900000 (max
turbo)
2300000 17 2300000 (max
physical non turbo)
2200000 15 2100000
2100000 15 2100000
2000000 13 1900000
1900000 13 1900000
1800000 12 1800000
1700000 11 1700000
1600000 10 1600000
1500000 f 1500000
1400000 e 1400000
1300000 d 1300000
1200000 c 1200000
1100000 a 1000000
1000000 a 1000000
900000 9 900000
800000 8 800000
700000 7 700000
600000 6 600000
500000 5 500000
------------------------------------------------------------------
Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz)
in BIOS (not every system will let you adjust this).
The turbo activation ratio will be set to one less than that, which will
be 0x0a (So any request above 1000000KHz should result in turbo region
assuming no thermal limits).
Here _PPC will request max to 1100000KHz (which basically should still
result in turbo as this is more than the turbo activation ratio up to
max allowable turbo frequency), but actual calculation resulted in a max
ceiling P-State which is 0x0a. So under any load condition, this driver
will not request turbo P-States. This will be a huge performance hit.
When config TDP feature is ON, if the _PPC points to a frequency above
turbo activation ratio, the performance can still reach max turbo. In this
case we don't need to treat this as the reduced frequency in set_policy
callback.
In this change when config TDP is active (by checking if the physical max
non turbo ratio is more than the current max non turbo ratio), any request
above current max non turbo is treated as full performance.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Minor cleanups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use ACPI _PPC notification to limit max P state driver will request.
ACPI _PPC change notification is sent by BIOS to limit max P state
in several cases:
- Reduce impact of platform thermal condition
- When Config TDP feature is used, a changed _PPC is sent to
follow TDP change
- Remote node managers in server want to control platform power
via baseboard management controller (BMC)
This change registers with ACPI processor performance lib so that
_PPC changes are notified to cpufreq core, which in turns will
result in call to .setpolicy() callback. Also the way _PSS
table identifies a turbo frequency is not compatible to max turbo
frequency in intel_pstate, so the very first entry in _PSS needs
to be adjusted.
This feature can be turned on by using kernel parameters:
intel_pstate=support_acpi_ppc
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Minor cleanups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The frequency transition latency from pmin to pmax is observed to be in
few millisecond granurality. And it usually happens to take a performance
penalty during sudden frequency rampup requests.
This patch set solves this problem by using an entity called "global
pstates". The global pstate is a Chip-level entity, so the global entitiy
(Voltage) is managed across the cores. The local pstate is a Core-level
entity, so the local entity (frequency) is managed across threads.
This patch brings down global pstate at a slower rate than the local
pstate. Hence by holding global pstates higher than local pstate makes
the subsequent rampups faster.
A per policy structure is maintained to keep track of the global and
local pstate changes. The global pstate is brought down using a parabolic
equation. The ramp down time to pmin is set to ~5 seconds. To make sure
that the global pstates are dropped at regular interval , a timer is
queued for every 2 seconds during ramp-down phase, which eventually brings
the pstate down to local pstate.
Iozone results show fairly consistent performance boost.
YCSB on redis shows improved Max latencies in most cases.
Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb
with different record sizes . The following table shows IOoperations/sec
with and without patch.
Iozone Results ( in op/sec) ( mean over 3 iterations )
---------------------------------------------------------------------
file size- with without %
recordsize-IOtype patch patch change
----------------------------------------------------------------------
200704-1-SeqWrite 1616532 1615425 0.06
200704-1-Rewrite 2423195 2303130 5.21
200704-2-SeqWrite 1628577 1602620 1.61
200704-2-Rewrite 2428264 2312154 5.02
200704-4-SeqWrite 1617605 1617182 0.02
200704-4-Rewrite 2430524 2351238 3.37
200704-8-SeqWrite 1629478 1600436 1.81
200704-8-Rewrite 2415308 2298136 5.09
200704-16-SeqWrite 1619632 1618250 0.08
200704-16-Rewrite 2396650 2352591 1.87
200704-32-SeqWrite 1632544 1598083 2.15
200704-32-Rewrite 2425119 2329743 4.09
200704-64-SeqWrite 1617812 1617235 0.03
200704-64-Rewrite 2402021 2321080 3.48
200704-128-SeqWrite 1631998 1600256 1.98
200704-128-Rewrite 2422389 2304954 5.09
200704-256 SeqWrite 1617065 1616962 0.00
200704-256-Rewrite 2432539 2301980 5.67
200704-512-SeqWrite 1632599 1598656 2.12
200704-512-Rewrite 2429270 2323676 4.54
200704-1024-SeqWrite 1618758 1616156 0.16
200704-1024-Rewrite 2431631 2315889 4.99
401408-1-SeqWrite 1631479 1608132 1.45
401408-1-Rewrite 2501550 2459409 1.71
401408-2-SeqWrite 1617095 1626069 -0.55
401408-2-Rewrite 2507557 2443621 2.61
401408-4-SeqWrite 1629601 1611869 1.10
401408-4-Rewrite 2505909 2462098 1.77
401408-8-SeqWrite 1617110 1626968 -0.60
401408-8-Rewrite 2512244 2456827 2.25
401408-16-SeqWrite 1632609 1609603 1.42
401408-16-Rewrite 2500792 2451405 2.01
401408-32-SeqWrite 1619294 1628167 -0.54
401408-32-Rewrite 2510115 2451292 2.39
401408-64-SeqWrite 1632709 1603746 1.80
401408-64-Rewrite 2506692 2433186 3.02
401408-128-SeqWrite 1619284 1627461 -0.50
401408-128-Rewrite 2518698 2453361 2.66
401408-256-SeqWrite 1634022 1610681 1.44
401408-256-Rewrite 2509987 2446328 2.60
401408-512-SeqWrite 1617524 1628016 -0.64
401408-512-Rewrite 2504409 2442899 2.51
401408-1024-SeqWrite 1629812 1611566 1.13
401408-1024-Rewrite 2507620 2442968 2.64
Tested with YCSB workload (50% update + 50% read) over redis for 1 million
records and 1 million operation. Each test was carried out with target
operations per second and persistence disabled.
Max-latency (in us)( mean over 5 iterations )
---------------------------------------------------------------
op/s Operation with patch without patch %change
---------------------------------------------------------------
15000 Read 61480.6 50261.4 22.32
15000 cleanup 215.2 293.6 -26.70
15000 update 25666.2 25163.8 2.00
25000 Read 32626.2 89525.4 -63.56
25000 cleanup 292.2 263.0 11.10
25000 update 32293.4 90255.0 -64.22
35000 Read 34783.0 33119.0 5.02
35000 cleanup 321.2 395.8 -18.8
35000 update 36047.0 38747.8 -6.97
40000 Read 38562.2 42357.4 -8.96
40000 cleanup 371.8 384.6 -3.33
40000 update 27861.4 41547.8 -32.94
45000 Read 42271.0 88120.6 -52.03
45000 cleanup 263.6 383.0 -31.17
45000 update 29755.8 81359.0 -63.43
(test without target op/s)
47659 Read 83061.4 136440.6 -39.12
47659 cleanup 195.8 193.8 1.03
47659 update 73429.4 124971.8 -41.24
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
commit 1b0289848d ("cpufreq: powernv: Add sysfs attributes to show
throttle stats") used policy->driver_data as a flag for one-time creation
of throttle sysfs files. Instead of this use 'kernfs_find_and_get()' to
check if the attribute already exists. This is required as
policy->driver_data is used for other purposes in the later patch.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the config TDP level is not nominal (level = 0), the MSR values for
reading level 1 and level 2 ratios contain power in low 14 bits and actual
ratio bits are at bits [23:16]. The current processing for level 1 and
level 2 is wrong as there is no shift done to get actual ratio.
Fixes: 6a35fc2d6c (cpufreq: intel_pstate: get P1 from TAR when available)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The way cpufreq_governor_start() initializes j_cdbs->prev_load is
questionable.
First off, j_cdbs->prev_cpu_wall used as a denominator in the
computation may be zero. The case this happens is when
get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
used to return that number is called exactly at the jiffies_64
wrap time. It is rather hard to trigger that error, but it is not
impossible and it will just crash the kernel then.
Second, j_cdbs->prev_load is computed as the average load during
the entire time since the system started and it may not reflect the
load in the previous sampling period (as it is expected to).
That doesn't play well with the way dbs_update() uses that value.
Namely, if the update time delta (wall_time) happens do be greater
than twice the sampling rate on the first invocation of it, the
initial value of j_cdbs->prev_load (which may be completely off) will
be returned to the caller as the current load (unless it is equal to
zero and unless another CPU sharing the same policy object has a
greater load value).
For this reason, notice that the prev_load field of struct cpu_dbs_info
is only used by dbs_update() and only in that one place, so if
cpufreq_governor_start() is modified to always initialize it to 0,
it will make dbs_update() always compute the actual load first time
it checks the update time delta against the doubled sampling rate
(after initialization) and there won't be any side effects of it.
Consequently, modify cpufreq_governor_start() as described.
Fixes: 18b46abd00 (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch add rockchip's compatible string to the compat list and
remove similar code from platform code for supporting generic platdev
driver.
Signed-off-by: Finley Xiao <finley.xiao@rock-chips.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Note that the complete routine imx27_dt_init() is removed as
of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL);
has same effect as a NULL .init_machine machine callback pointer.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The machines array in cpufreq-dt-platdev is used only once at boot time
and so should be marked with __initconst, so that kernel can free up
memory used for it, if required.
Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cooling device is registered by ready callback. It's also invoked while
system resuming from sleep (Enabling non-boot cpus). Thus cooling device
may be multiple registered. Matchable unregistration is added to exit
callback to fix this issue.
Signed-off-by: Jia Hongtao <hongtao.jia@nxp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
.exit callback (qoriq_cpufreq_cpu_exit()) is also used during suspend.
So __exit macro should be removed or the function will be discarded.
Signed-off-by: Jia Hongtao <hongtao.jia@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When THERMAL_OF is undefined the cooling device messages should not be
shown. -ENOSYS is returned from of_cpufreq_cooling_register() when
THERMAL_OF is undefined.
Signed-off-by: Jia Hongtao <hongtao.jia@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a function to cleanup at module exit and export
appropriate GPL string to enable moduler support
for the cppc_cpufreq driver.
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@intel.com>
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The result returned by pid_calc() is subtracted from current_pstate
(which is the P-State requested during the last period) in order to
obtain the target P-State for the current iteration.
However, current_pstate may not reflect the real current P-State of
the CPU. In particular, that P-State may be higher because of the
frequency sharing per module.
The theory is:
- The load is the percentage of time spent in C0 and is related to
the average P-State during the same period.
- The last requested P-State can be completely different than the
average P-State (because of frequency sharing or throttling).
- The P-State shift computed by the pid_calc is based on the load
computed at average P-State, so the shift must be relative to
this average P-State.
Using the average P-State instead of current P-State improves power
without significant performance penalty in cases when a task migrates
from one core to other core sharing frequency and voltage.
Performance and power comparison with this patch on Cherry Trail
platform using Android:
Benchmark ?Perf ?Power
FishTank 10.45% 3.1%
SmartBench-Gaming -0.1% -10.4%
SmartBench-Productivity -0.8% -10.4%
CandyCrush n/a -17.4%
AngryBirds n/a -5.9%
videoPlayback n/a -13.9%
audioPlayback n/a -4.9%
IcyRocks-20-50 0.0% -38.4%
iozone RR -0.16% -1.3%
iozone RW 0.74% -1.3%
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit 0df35026c6 (cpufreq: governor: Fix negative idle_time
when configured with CONFIG_HZ_PERIODIC) that introduced a regression
by causing the ondemand cpufreq governor to misbehave for
CONFIG_TICK_CPU_ACCOUNTING unset (the frequency goes up to the max at
one point and stays there indefinitely).
The revert takes subsequent modifications of the code in question into
account.
Fixes: 0df35026c6 (cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=115261
Reported-and-tested-by: Timo Valtoaho <timo.valtoaho@gmail.com>
Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* pm-cpufreq-fixes:
cpufreq: Abort cpufreq_update_current_freq() for cpufreq_suspended set
intel_pstate: Avoid getting stuck in high P-states when idle
Since governor operations are generally skipped if cpufreq_suspended
is set, cpufreq_start_governor() should do nothing in that case.
That function is called in the cpufreq_online() path, and may also
be called from cpufreq_offline() in some cases, which are invoked
by the nonboot CPUs disabing/enabling code during system suspend
to RAM and resume. That happens when all devices have been
suspended, so if the cpufreq driver relies on things like I2C to
get the current frequency, it may not be ready to do that then.
To prevent problems from happening for this reason, make
cpufreq_update_current_freq(), which is the only function invoked
by cpufreq_start_governor() that doesn't check cpufreq_suspended
already, return 0 upfront if cpufreq_suspended is set.
Fixes: 3bbf8fe3ae (cpufreq: Always update current frequency before startig governor)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The Kconfig for this driver is currently:
config CPU_FREQ_CBE_PMI
bool "CBE frequency scaling using PMI interface"
...meaning that it currently is not being built as a module by
anyone. Lets remove the modular and unused code here, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.
We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Krafft <krafft@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-pm@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jörg Otte reports that commit a4675fbc4a (cpufreq: intel_pstate:
Replace timers with utilization update callbacks) caused the CPUs in
his Haswell-based system to stay in the very high frequency region
even if the system is completely idle.
That turns out to be an existing problem in the intel_pstate driver's
P-state selection algorithm for Core processors. Namely, all
decisions made by that algorithm are based on the average frequency
of the CPU between sampling events and on the P-state requested on
the last invocation, so it may get stuck at a very hight frequency
even if the utilization of the CPU is very low (in fact, it may get
stuck in a inadequate P-state regardless of the CPU utilization).
The only way to kick it out of that limbo is a sufficiently long idle
period (3 times longer than the prescribed sampling interval), but if
that doesn't happen often enough (eg. due to a timing change like
after the above commit), the P-state of the CPU may be inadequate
pretty much all the time.
To address the most egregious manifestations of that issue, reset the
core_busy value used to determine the next P-state to request if the
utilization of the CPU, determined with the help of the MPERF
feedback register and the TSC, is below 1%.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=115771
Reported-and-tested-by: Jörg Otte <jrg.otte@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The freq-table is stored in struct cpufreq_policy also and there is
absolutely no need of keeping a copy of its reference in struct
acpi_cpufreq_data. Drop it.
Also policy->freq_table can't be NULL in the target() callback, remove
the useless check as well.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Its always set by ->init() and so it will always be there in ->exit().
There is no need to have a special check for just that.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reorganize the code in cpufreq_add_dev() to avoid using the ret
variable and reduce the indentation level in it.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>