linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-23 16:09:57 +07:00

Author	SHA1	Message	Date
Rafael J. Wysocki	4ddd0146c7	cpufreq: intel_pstate: Fold intel_pstate_reset_all_pid() into the caller There is only one caller of intel_pstate_reset_all_pid(), which is pid_param_set() used in the debugfs interface only, and having that code split does not make it particularly convenient to follow. For this reason, move the body of intel_pstate_reset_all_pid() into its caller and drop that function. Also change the loop from for_each_online_cpu() (which is obviously racy with respect to CPU offline/online) to for_each_possible_cpu(), so that all PID parameters are reset for all CPUs regardless of their online/offline status (to prevent, for example, a previously offline CPU from going online with a stale set of PID parameters). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-28 23:12:09 +02:00
Rafael J. Wysocki	5c43905369	cpufreq: intel_pstate: Initialize pid_params statically Notice that both the existing struct cpu_defaults instances in which PID parameters are actually initialized use the same values of those parameters, so it is not really necessary to copy them over to pid_params dynamically. Instead, initialize pid_params statically with those values and drop the unused pid_policy member from struct cpu_defaults along with copy_pid_params() used for initializing it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-28 23:12:08 +02:00
Rafael J. Wysocki	6404367862	cpufreq: intel_pstate: Drop pointless initialization of PID parameters The P-state selection algorithm used by intel_pstate for Atom processors is not based on the PID controller and the initialization of PID parametrs for those processors is pointless and confusing, so drop it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-28 23:12:07 +02:00
Rafael J. Wysocki	e14cf8857e	cpufreq: intel_pstate: Eliminate struct perf_limits After recent changes the purpose of struct perf_limits is not particularly clear any more and the code may be made somewhat easier to follow by eliminating it, so go for that. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-28 23:12:07 +02:00
Rafael J. Wysocki	80b120ca1a	cpufreq: intel_pstate: Avoid transient updates of cpuinfo.max_freq Both intel_pstate_verify_policy() and intel_cpufreq_verify_policy() set policy->cpuinfo.max_freq depending on the turbo status, but the updates made by them are discarded by the core, because the policy object passed to them by the core is temporary and cpuinfo.max_freq from that object is not copied to the final policy object in cpufreq_set_policy(). However, cpufreq_set_policy() passes the temporary policy object to the ->setpolicy callback of the driver, so intel_pstate_set_policy() actually sees the policy->cpuinfo.max_freq value updated by intel_pstate_verify_policy() and not the final one. It also updates policy->max sometimes which basically has no effect after it returns, because the core discards that update. To avoid confusion, eliminate policy->cpuinfo.max_freq updates from intel_pstate_verify_policy() and intel_cpufreq_verify_policy() entirely and check the maximum frequency explicitly in intel_pstate_update_perf_limits() instead of relying on the transiently updated policy->cpuinfo.max_freq value. Moreover, move the max->policy adjustment carried out in intel_pstate_set_policy() to a separate function and call that function from the ->verify driver callbacks to ensure that it will actually be effective. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-24 03:04:32 +01:00
Rafael J. Wysocki	c5a2ee7dde	cpufreq: intel_pstate: Active mode P-state limits rework The coordination of P-state limits used by intel_pstate in the active mode (ie. by default) is problematic, because it synchronizes all of the limits (ie. the global ones and the per-policy ones) so as to use one common pair of P-state limits (min and max) across all CPUs in the system. The drawbacks of that are as follows: - If P-states are coordinated in hardware, it is not necessary to coordinate them in software on top of that, so in that case all of the above activity is in vain. - If P-states are not coordinated in hardware, then the processor is actually capable of setting different P-states for different CPUs and coordinating them at the software level simply doesn't allow that capability to be utilized. - The coordination works in such a way that setting a per-policy limit (eg. scaling_max_freq) for one CPU causes the common effective limit to change (and it will affect all of the other CPUs too), but subsequent reads from the corresponding sysfs attributes for the other CPUs will return stale values (which is confusing). - Reads from the global P-state limit attributes, min_perf_pct and max_perf_pct, return the effective common values and not the last values set through these attributes. However, the last values set through these attributes become hard limits that cannot be exceeded by writes to scaling_min_freq and scaling_max_freq, respectively, and they are not exposed, so essentially users have to remember what they are. All of that is painful enough to warrant a change of the management of P-state limits in the active mode. To that end, redesign the active mode P-state limits management in intel_pstate in accordance with the following rules: (1) All CPUs are affected by the global limits (that is, none of them can be requested to run faster than the global max and none of them can be requested to run slower than the global min). (2) Each individual CPU is affected by its own per-policy limits (that is, it cannot be requested to run faster than its own per-policy max and it cannot be requested to run slower than its own per-policy min). (3) The global and per-policy limits can be set independently. Also, the global maximum and minimum P-state limits will be always expressed as percentages of the maximum supported turbo P-state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-24 03:04:31 +01:00
Rafael J. Wysocki	553953453b	cpufreq: intel_pstate: Use load-based P-state selection more widely Extend the set of systems for which intel_pstate will use the "powersave" P-state selection algorithm based on CPU load in the active mode by systems with ACPI preferred profile set to "tablet", "appliance PC", "desktop", or "workstation" (ie. everything with a specified preferred profile that is not a "server"). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-24 03:04:31 +01:00
Rafael J. Wysocki	eb5139d1a2	cpufreq: intel_pstate: Support HWP processors in all operation modes Currently, some processors supporting HWP are only supported by intel_pstate if HWP is actually going to be used and not supported otherwise which is confusing. Specifically, they are not supported if "intel_pstate=no_hwp" is passed to the kernel in the command line or if the driver is started in the passive mode ("intel_pstate=passive"). There is no real reason for that, because everything about those processor is known anyway and the driver can work with them in all modes, so make that happen, but use the load-based P-state selection algorithm for the active mode "powersave" policy with them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-24 03:04:31 +01:00
Rafael J. Wysocki	f1a91645b7	Merge back intel_pstate updates for 4.12.	2017-03-24 03:04:10 +01:00
Rafael J. Wysocki	64897b20ee	cpufreq: intel_pstate: Fix policy data management in passive mode The policy->cpuinfo.max_freq and policy->max updates in intel_cpufreq_turbo_update() are excessive as they are done for no good reason and may lead to problems in principle, so they should be dropped. However, after dropping them intel_cpufreq_turbo_update() becomes almost entirely pointless, because the check made by it is made again down the road in intel_pstate_prepare_request(). The only thing in it that still needs to be done is the call to update_turbo_state(), so drop intel_cpufreq_turbo_update() altogether and make its callers invoke update_turbo_state() directly instead of it. In addition to that, fix intel_cpufreq_verify_policy() so that it checks global.no_turbo in addition to global.turbo_disabled when updating policy->cpuinfo.max_freq to make it consistent with intel_pstate_verify_policy(). Fixes: `001c76f05b` (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-21 22:19:07 +01:00
Rafael J. Wysocki	7de32556df	cpufreq: intel_pstate: One set of global limits in active mode In the active mode intel_pstate currently uses two sets of global limits, each associated with one of the possible scaling_governor settings in that mode: "powersave" or "performance". The driver switches over from one of those sets to the other depending on the scaling_governor setting for the last CPU whose per-policy cpufreq interface in sysfs was last used to change parameters exposed in there. That obviously leads to no end of issues when the scaling_governor settings differ between CPUs. The most recent issue was introduced by commit `a240c4aa5d` (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) that eliminated the reinitialization of "performance" limits in intel_pstate_set_policy() preventing the max limit from being set to anything below 100, among other things. Namely, an undesirable side effect of commit `a240c4aa5d` is that now, after setting scaling_governor to "performance" in the active mode, the per-policy limits for the CPU in question go to the highest level and stay there even when it is switched back to "powersave" later. As it turns out, some distributions set scaling_governor to "performance" temporarily for all CPUs to speed-up system initialization, so that change causes them to misbehave later. To fix that, get rid of the performance/powersave global limits split and use just one set of global limits for everything. From the user's persepctive, after this modification, when scaling_governor is switched from "performance" to "powersave" or the other way around on one CPU, the limits settings (ie. the global max/min_perf_pct and per-policy scaling_max/min_freq for any CPUs) will not change. Still, switching from "performance" to "powersave" or the other way around changes the way in which P-states are selected and in particular "performance" causes the driver to always request the highest P-state it is allowed to ask for for the given CPU. Fixes: `a240c4aa5d` (cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-18 00:57:39 +01:00
Rafael J. Wysocki	e4c204ced0	cpufreq: intel_pstate: Avoid percentages in limits-related computations Currently, intel_pstate_update_perf_limits() first converts the policy minimum and maximum limits into percentages of the maximum turbo frequency (rounding up to an integer) and then converts these percentages to fractions (by using fixed-point arithmetic to divide them by 100). That introduces a rounding error unnecessarily, because the fractions can be obtained by carrying out fixed-point divisions directly on the input numbers. Rework the computations in intel_pstate_hwp_set() to use fractions instead of percentages (and drop redundant local variables from there) and modify intel_pstate_update_perf_limits() to compute the fractions directly and percentages out of them. While at it, introduce percent_ext_fp() for converting percentages to fractions (with extended number of fraction bits) and use it in the computations. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-15 16:52:29 +01:00
Srinivas Pandruvada	3f8ed54aee	cpufreq: intel_pstate: Correct frequency setting in the HWP mode In the functions intel_pstate_hwp_set(), min/max range from HWP capability MSR along with max_perf_pct and min_perf_pct, is used to set the HWP request MSR. In some cases this doesn't result in the correct HWP max/min in HWP request. For example: In the following case: HWP capabilities from MSR 0x771 0x70a1220 Here cpufreq min/max frequencies from above MSR dump are 700MHz and 3.2GHz respectively. This will result in hwp_min = 0x07 hwp_max = 0x20 To limit max frequency to 2GHz: perf_limits->max_perf_pct = 63 (2GHz as a percent of 3.2GHz rounded up) With the current calculation: adj_range = max_perf_pct * range / 100; adj_range = 63 * (32 - 7) / 100 adj_range = 15 max = hw_min + adj_range; max = 7 + 15 = 22 This will result in HWP request of 0x160f, which will result in a frequency cap of 2.2GHz not 2GHz. The problem with the above calculation is that hwp_min of 7 is treated as 0% in the range. But max_perf_pct is calculated with respect to minimum as 0 and max as 3.2GHz or hwp_max, so adding hwp_min to it will result in more than the desired. Since the min_perf_pct and max_perf_pct is already a percent of max frequency or hwp_max, this min/max HWP request value can be calculated directly applying these percentage to hwp_max. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-14 03:56:39 +01:00
Rafael J. Wysocki	6e7408acd0	cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set() Fix the debugfs interface for PID tuning to actually update pid_params.sample_rate_ns on PID parameters updates, as changing pid_params.sample_rate_ms via debugfs has no effect now. Fixes: `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2017-03-13 23:55:12 +01:00
Rafael J. Wysocki	5f98ced1c9	cpufreq: intel_pstate: Drop redundant wrapper function intel_pstate_hwp_set_policy() is a wrapper around intel_pstate_hwp_set(), but the only value it adds is to check hwp_active before calling the latter and one of its two callers has already checked hwp_active before that happens, so in that code path the additional check is redundant and using the wrapper is rather pointless. For this reason, drop intel_pstate_hwp_set_policy() and make its callers invoke intel_pstate_hwp_set() directly (after checking hwp_active). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2017-03-12 23:07:58 +01:00
Rafael J. Wysocki	fd8e57d5d3	Merge branch 'pm-cpufreq' * pm-cpufreq: cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy cpufreq: intel_pstate: Fix intel_pstate_verify_policy() cpufreq: intel_pstate: Fix global settings in active mode cpufreq: Add the "cpufreq.off=1" cmdline option cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy() cpufreq: intel_pstate: Do not use performance_limits in passive mode	2017-03-09 15:12:27 +01:00
Rafael J. Wysocki	a240c4aa5d	cpufreq: intel_pstate: Do not reinit performance limits in ->setpolicy If the current P-state selection algorithm is set to "performance" in intel_pstate_set_policy(), the limits may be initialized from scratch, but only if no_turbo is not set and the maximum frequency allowed for the given CPU (i.e. the policy object representing it) is at least equal to the max frequency supported by the CPU. In all of the other cases, the limits will not be updated. For example, the following can happen: # cat intel_pstate/status active # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 100 # cat cpufreq/policy0/scaling_max_freq 3100000 echo 3000000 > cpufreq/policy0/scaling_max_freq # cat intel_pstate/min_perf_pct 94 # echo 95 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 95 That is confusing for two reasons. First, the initial attempt to change min_perf_pct to 94 seems to have no effect, even though setting the global limits should always work. Second, after changing scaling_max_freq for policy0 the global min_perf_pct attribute shows 94, even though it should have not been affected by that operation in principle. Moreover, the final attempt to change min_perf_pct to 95 worked as expected, because scaling_max_freq for the only policy with scaling_governor equal to "performance" was different from the maximum at that time. To make all that confusion go away, modify intel_pstate_set_policy() so that it doesn't reinitialize the limits at all. At the same time, change intel_pstate_set_performance_limits() to set min_sysfs_pct to 100 in the "performance" limits set so that switching the P-state selection algorithm to "performance" causes intel_pstate/min_perf_pct in sysfs to go to 100 (or whatever value min_sysfs_pct in the "performance" limits is set to later). That requires per-CPU limits to be initialized explicitly rather than by copying the global limits to avoid setting min_sysfs_pct in the per-CPU limits to 100. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-06 00:06:05 +01:00
Rafael J. Wysocki	d74b199291	cpufreq: intel_pstate: Fix intel_pstate_verify_policy() The code added to intel_pstate_verify_policy() by commit `1443ebbacf` (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy) should use perf_limits instead of limits, because otherwise setting global limits via sysfs may affect policies inconsistently. For example, in the sequence of shell commands below, the scaling_min_freq attribute for policy1 and policy2 should be affected in the same way, because scaling_governor is set in the same way for both of them: # cat cpufreq/policy1/scaling_governor powersave # cat cpufreq/policy2/scaling_governor powersave # echo performance > cpufreq/policy0/scaling_governor # echo 94 > intel_pstate/min_perf_pct # cat cpufreq/policy0/scaling_min_freq 2914000 # cat cpufreq/policy1/scaling_min_freq 2914000 # cat cpufreq/policy2/scaling_min_freq 800000 The are affected differently, because intel_pstate_verify_policy() is invoked with limits set to &performance_limits (left behind by policy0) for policy1 and with limits set to &powersave_limits (left behind by policy1) for policy2. Since perf_limits is set to the set of limits matching the policy being updated, using it instead of limits fixes the inconsistency. Fixes: `1443ebbacf` (cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-06 00:06:05 +01:00
Rafael J. Wysocki	cd59b4bed9	cpufreq: intel_pstate: Fix global settings in active mode Commit `111b8b3fe4` (cpufreq: intel_pstate: Always keep all limits settings in sync) changed intel_pstate to invoke cpufreq_update_policy() for every registered CPU on global sysfs attributes updates, but that led to undesirable effects in the active mode if the "performance" P-state selection algorithm is configufred for one CPU and the "powersave" one is chosen for all of the other CPUs. Namely, in that case, the following is possible: # cd /sys/devices/system/cpu/ # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 26 # echo performance > cpufreq/policy0/scaling_governor # cat intel_pstate/max_perf_pct 100 # cat intel_pstate/min_perf_pct 100 # echo 94 > intel_pstate/min_perf_pct # cat intel_pstate/min_perf_pct 26 The reason why this happens is because intel_pstate attempts to maintain two sets of global limits in the active mode, one for the "performance" P-state selection algorithm and one for the "powersave" P-state selection algorithm, but the P-state selection algorithms are set per policy, so the global limits cannot reflect all of them at the same time if they are different for different policies. In the particular situation above, the attempt to change min_perf_pct to 94 caused cpufreq_update_policy() to be run for a CPU with the "powersave" P-state selection algorithm and intel_pstate_set_policy() called by it silently switched the global limits to the "powersave" set which finally was reflected by the sysfs interface. To prevent that from happening, modify intel_pstate_update_policies() to always switch back to the set of limits that was used right before it has been invoked. Fixes: `111b8b3fe4` (cpufreq: intel_pstate: Always keep all limits settings in sync) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-06 00:06:04 +01:00
Rafael J. Wysocki	6407829901	cpufreq: intel_pstate: Avoid triggering cpu_frequency tracepoint unnecessarily In the passive mode the cpu_frequency trace event is already triggered by the cpufreq core or by scaling governors, so intel_pstate should not trigger it once again for the same P-state updates. In addition to that, the frequency returned by intel_cpufreq_fast_switch() and passed via freqs.new from intel_cpufreq_target() to cpufreq_freq_transition_end() should reflect the P-state actually set, so make that happen. Fixes: `001c76f05b` (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-04 01:38:42 +01:00
Rafael J. Wysocki	7f17326fc0	cpufreq: intel_pstate: Fix intel_cpufreq_verify_policy() The intel_pstate_update_perf_limits() called from intel_cpufreq_verify_policy() may cause global P-state limits to change which is generally confusing and unnecessary. In the passive mode the global limits are only applied to the frequency selected by the scaling governor (they are not taken into account by governors when making decisions anyway), so making them follow the per-policy limits serves no purpose and may go against user expectations (as it generally causes the global attributes in sysfs to change even though they have not been written to in some cases). Fix that by dropping the intel_pstate_update_perf_limits() invocation from intel_cpufreq_verify_policy() (which also reduces the code size by a few lines). This change does not affect the per-CPU limits case, because those limits allow any P-state to be set by default in the passive mode and it removes the only piece of code updating them in that mode, so the per-policy settings will be the only ones taken into account in that case as expected. Fixes: `001c76f05b` (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-04 01:38:41 +01:00
Rafael J. Wysocki	2bc756e7dd	cpufreq: intel_pstate: Do not use performance_limits in passive mode Using performance_limits in the passive mode doesn't make sense, because in that mode the global limits are applied to the frequency selected by the scaling governor. The maximum and minimum P-state limits in performance_limits are both set to 100 percent which will put all CPUs into the turbo range regardless of what governor is used and what frequencies are selected by it (that is particularly undesirable on CPUs with the generic powersave governor attached). For this reason, make intel_pstate_register_driver() always point limits to powersave_limits in the passive mode. Fixes: `001c76f05b` (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-03-04 01:38:41 +01:00
Linus Torvalds	1827adb11a	Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...	2017-03-03 10:16:38 -08:00
Linus Torvalds	c82be9d224	Power management turbostat utility updates for v4.11-rc1 These update turbostat significantly and in particular: - Default output is now verbose, --debug is no longer required to get all counters. As a result, some options have been added to specify exactly what output is wanted. - Added --quiet to skip system configuration output - Added --list, --show and --hide parameters - Added --cpu parameter - Enhanced Baytrail SoC support - Added Gemini Lake SoC support - Added sysfs C-state columns Also the symbol definitions in arch/x86/include/asm/intel-family.h and arch/x86/include/asm/msr-index.h are updated and the intel_idle and intel_pstate drivers are modified to use the updated symbols. Credits to Len Brown for all of these changes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJYuLyNAAoJEILEb/54YlRxEvkQAJsggzpgGrlhrO6KHSm4yC9M CqhBVsdeppX1ZTAVPiMk/pcXQYtL5fZ97ELk2So/CjT5Nh3jwDPMA/ux5n3uiob+ O2BTdtxnpNLxPQPQM1mW7Dr/uAIRlJug9gSMxKDbFSU9Oe3aET58PUdUTs7xaT59 nbtLxVSvzrdGk/bX6WO4ic+7F2licJLZPfDGhYidnoika8LxD4M+cIO73gFpgqQi yoKrTZyLimvneFT0eAUUvHIyKjkJIxeMfslW57uBpz8rW5my+3UwsdpRG4AIVeWc wSBlsNqj+TuR4BBiZ2VR2RoHF3qbH/SceI+k864BqyThfyK/g2q/vV/GvLZQCR/R yWcajWD9kvLKvnm1D3XYOIQDBeP4l60j3vVwHytSvmaPYjn5Ms3jq6b+2K6zkXMM 8y3leW/hgw+rGCacdXPrKIlpBykSV7h+TnD2iMxeeDISNkbefWWDe/WB6HncocAg HDtKRvU9ntRq6/MlnTKbCFM5c0oCXWRw4QNjDy3AsjJELgeAIwiqpHWMKO6XltFj qU/rdyW/BTCuAlIjWVbjooAIJZ268geupeug3zvE3uGzrxT4DaVIo8W1wtJ+XQrt By7sOW/gMQ2EcTJQiuFjS/Gz5gOKQ2F8OLCm6T8Prjh6SxrCUAiuIvP0LmxUCa8i KMlx+8c9E2f9j+TTt9AP =oMZe -----END PGP SIGNATURE----- Merge tag 'pm-turbostat-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull turbostat utility updates from Rafael Wysocki: "Power management turbostat utility updates. These update turbostat significantly and in particular: - default output is now verbose, --debug is no longer required to get all counters. As a result, some options have been added to specify exactly what output is wanted. - added --quiet to skip system configuration output - added --list, --show and --hide parameters - added --cpu parameter - enhanced Baytrail SoC support - added Gemini Lake SoC support - added sysfs C-state columns Also the symbol definitions in arch/x86/include/asm/intel-family.h and arch/x86/include/asm/msr-index.h are updated and the intel_idle and intel_pstate drivers are modified to use the updated symbols. Credits to Len Brown for all of these changes" * tag 'pm-turbostat-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (44 commits) tools/power turbostat: version 17.02.24 tools/power turbostat: bugfix: --add u32 was printed as u64 tools/power turbostat: show error on exec tools/power turbostat: dump p-state software config tools/power turbostat: show package number, even without --debug tools/power turbostat: support "--hide C1" etc. tools/power turbostat: move --Package and --processor into the --cpu option tools/power turbostat: turbostat.8 update tools/power turbostat: update --list feature tools/power turbostat: use wide columns to display large numbers tools/power turbostat: Add --list option to show available header names tools/power turbostat: fix zero IRQ count shown in one-shot command mode tools/power turbostat: add --cpu parameter tools/power turbostat: print sysfs C-state stats tools/power turbostat: extend --add option to accept /sys path tools/power turbostat: skip unused counters on BDX tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits tools/power turbostat: skip unused counters on SKX tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7 tools/power turbostat: initial Gemini Lake SOC support ...	2017-03-02 17:41:27 -08:00
Ingo Molnar	55687da166	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/cpufreq.h> We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/cpufreq.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:30 +01:00
Rafael J. Wysocki	6bff9c609f	Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull changes related to turbostat for v4.11 from Len Brown. * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (44 commits) tools/power turbostat: version 17.02.24 tools/power turbostat: bugfix: --add u32 was printed as u64 tools/power turbostat: show error on exec tools/power turbostat: dump p-state software config tools/power turbostat: show package number, even without --debug tools/power turbostat: support "--hide C1" etc. tools/power turbostat: move --Package and --processor into the --cpu option tools/power turbostat: turbostat.8 update tools/power turbostat: update --list feature tools/power turbostat: use wide columns to display large numbers tools/power turbostat: Add --list option to show available header names tools/power turbostat: fix zero IRQ count shown in one-shot command mode tools/power turbostat: add --cpu parameter tools/power turbostat: print sysfs C-state stats tools/power turbostat: extend --add option to accept /sys path tools/power turbostat: skip unused counters on BDX tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits tools/power turbostat: skip unused counters on SKX tools/power turbostat: Denverton: use HW CC1 counter, skip C3, C7 tools/power turbostat: initial Gemini Lake SOC support ...	2017-03-01 23:34:38 +01:00
Len Brown	92134bdbc6	intel_pstate: use MSR_ATOM_RATIOS definitions from msr-index.h Originally, these MSRs were locally defined in this driver. Now the definitions are in msr-index.h -- use them. Signed-off-by: Len Brown <len.brown@intel.com>	2017-03-01 00:14:03 -05:00
Rafael J. Wysocki	c3a49c8991	cpufreq: intel_pstate: Fix limits issue with operation mode switching There is a problem with intel_pstate operation mode switching introduced by commit `fb1fe1041c` (cpufreq: intel_pstate: Operation mode control from sysfs), because the global sysfs limits are preserved across operation modes while per-policy limits are reinitialized from scratch on a mode switch and both sets of limits may get out of sync this way. Fix that by always reinitializing the global limits upon the registration of the driver. Fixes: `fb1fe1041c` (cpufreq: intel_pstate: Operation mode control from sysfs) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2017-02-28 13:55:52 +01:00
Rafael J. Wysocki	56c7303e62	Merge back earlier cpufreq changes for v4.11.	2017-02-09 01:18:14 +01:00
Rafael J. Wysocki	cbf304e420	Merge branches 'pm-core-fixes' and 'pm-cpufreq-fixes' * pm-core-fixes: PM / runtime: Avoid false-positive warnings from might_sleep_if() * pm-cpufreq-fixes: cpufreq: intel_pstate: Disable energy efficiency optimization cpufreq: brcmstb-avs-cpufreq: properly retrieve P-state upon suspend cpufreq: brcmstb-avs-cpufreq: extend sysfs entry brcm_avs_pmap	2017-02-06 14:52:10 +01:00
Srinivas Pandruvada	6e978b22ef	cpufreq: intel_pstate: Disable energy efficiency optimization Some Kabylake desktop processors may not reach max turbo when running in HWP mode, even if running under sustained 100% utilization. This occurs when the HWP.EPP (Energy Performance Preference) is set to "balance_power" (0x80) -- the default on most systems. It occurs because the platform BIOS may erroneously enable an energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not recommended to be enabled on this SKU. On the failing systems, this BIOS issue was not discovered when the desktop motherboard was tested with Windows, because the BIOS also neglects to provide the ACPI/CPPC table, that Windows requires to enable HWP, and so Windows runs in legacy P-state mode, where this setting has no effect. Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and so it runs in HWP mode, exposing this incorrect BIOS configuration. There are several ways to address this problem. First, Linux can also run in legacy P-state mode on this system. As intel_pstate is how Linux enables HWP, booting with "intel_pstate=disable" will run in acpi-cpufreq/ondemand legacy p-state mode. Or second, the "performance" governor can be used with intel_pstate, which will modify HWP.EPP to 0. Or third, starting in 4.10, the /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference attribute in can be updated from "balance_power" to "performance". Or fourth, apply this patch, which fixes the erroneous setting of MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default configuration to function as designed. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Cc: 4.6+ <stable@vger.kernel.org> # 4.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-04 00:11:08 +01:00
Srinivas Pandruvada	8fc7554ae5	cpufreq: intel_pstate: Calculate guaranteed performance for HWP When HWP is active, turbo activation ratio is not used to calculate max non turbo ratio. But on these systems the max non turbo ratio is decided by config TDP settings. This change removes usage of MSR_TURBO_ACTIVATION_RATIO for HWP systems, instead directly use TDP ratios, when more than one TDPs are available. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-04 00:05:33 +01:00
Srinivas Pandruvada	4e5d3f713b	cpufreq: intel_pstate: Make HWP limits compatible with legacy Under HWP the performance limits are calculated using max_perf_pct and min_perf_pct using possible performance, not available performance. The available performance can be reduced by no_turbo setting. To make compatible with legacy mode, use max/min performance percentage with respect to available performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-04 00:05:32 +01:00
Srinivas Pandruvada	7d9a8a9f4e	cpufreq: intel_pstate: Lower frequency than expected under no_turbo When turbo is not disabled by BIOS, but user disabled from intel P-State sysfs and changes max/min using cpufreq sysfs, the resultant frequency is lower than what user requested. The reason for this, when the perf limits are calculated in set_policy() callback, they are with reference to max cpu frequency (turbo frequency ), but when enforced in the intel_pstate_get_min_max() they are with reference to max available performance as documented in the intel_pstate documentation (in this case max non turbo P-State). This needs similar change as done in intel_cpufreq_verify_policy() for passive mode. Set policy->cpuinfo.max_freq based on the turbo status. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-04 00:05:32 +01:00
Rafael J. Wysocki	fb1fe1041c	cpufreq: intel_pstate: Operation mode control from sysfs Make it possible to change the operation mode of intel_pstate with the help of a new sysfs attribute called "status". There are three possible configurations that can be selected using this attribute: "off" - The driver is not in use at this time. "active" - The driver works as a P-state governor (default). "passive" - The driver works as a regular cpufreq one and collaborates with the generic cpufreq governors (it sets P-states as requested by those governors). [This is the same mode the driver can be started in by passing intel_pstate=passive in the kernel command line.] The current setting is returned by reads from this attribute. Writing one of the above strings to it changes the operation mode as indicated by that string, if possible. If HW-managed P-states (HWP) feature is enabled, it is not possible to change the driver's operation mode and attempts to write to this attribute will fail. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-04 00:05:31 +01:00
Rafael J. Wysocki	0c30b65b3c	cpufreq: intel_pstate: Expose global sysfs attributes upfront Expose the intel_pstate's global sysfs attributes before registering the driver to prepare for the addition of an attribute that also will have to work if the driver is not registered. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-02-04 00:05:30 +01:00
Rafael J. Wysocki	ff7e593c9c	Merge branches 'pm-sleep' and 'pm-cpufreq' * pm-sleep: Revert "PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flag" * pm-cpufreq: cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy	2017-01-27 00:08:59 +01:00
Srinivas Pandruvada	1443ebbacf	cpufreq: intel_pstate: Fix sysfs limits enforcement for performance policy A side effect of keeping intel_pstate sysfs limits in sync with cpufreq is that the now sysfs limits can't enforced under performance policy. For example, if the max_perf_pct is changed from 100 to 80, this will call intel_pstate_set_policy(), which will change the max_perf to 100 again for performance policy. Same issue happens, when no_turbo is set. This change calculates max and min frequency using sysfs performance limits in intel_pstate_verify_policy() and adjusts policy limits by calling cpufreq_verify_within_limits(). Also, it causes the setting of performance limits to be skipped if no_turbo is set. Fixes: `111b8b3fe4` (cpufreq: intel_pstate: Always keep all limits settings in sync) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-01-20 03:35:27 +01:00
Rafael J. Wysocki	3baad65546	Merge branch 'pm-cpufreq' * pm-cpufreq: cpufreq: dt: Add support for APM X-Gene 2 cpufreq: intel_pstate: Always keep all limits settings in sync cpufreq: intel_pstate: Use locking in intel_cpufreq_verify_policy() cpufreq: intel_pstate: Use locking in intel_pstate_resume() cpufreq: intel_pstate: Do not expose PID parameters in passive mode	2017-01-06 14:34:52 +01:00
Rafael J. Wysocki	111b8b3fe4	cpufreq: intel_pstate: Always keep all limits settings in sync Make intel_pstate update per-logical-CPU limits when the global settings are changed to ensure that they are always in sync and users will not see confusing values in per-logical-CPU sysfs attributes. This also fixes the problem that setting the "no_turbo" global attribute to 1 in the "passive" mode (ie. when intel_pstate acts as a regular cpufreq driver) when scaling_governor is set to "performance" has no effect. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-12-31 21:48:44 +01:00
Rafael J. Wysocki	cad3046796	cpufreq: intel_pstate: Use locking in intel_cpufreq_verify_policy() Race conditions are possible if intel_cpufreq_verify_policy() is executed in parallel with global limits updates from sysfs, so the invocation of intel_pstate_update_perf_limits() in it should be carried out under intel_pstate_limits_lock. Make that happen. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-12-31 21:48:43 +01:00
Rafael J. Wysocki	aa439248ab	cpufreq: intel_pstate: Use locking in intel_pstate_resume() Theoretically, intel_pstate_resume() may be executed in parallel with intel_pstate_set_policy(), if the latter is invoked via cpufreq_update_policy() as a result of a notification, so use intel_pstate_limits_lock in there too to avoid race conditions. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-12-31 21:48:42 +01:00
Rafael J. Wysocki	366430b5c2	cpufreq: intel_pstate: Do not expose PID parameters in passive mode If intel_pstate works in the passive mode in which it acts as a regular cpufreq driver and collaborates with generic cpufreq governors, the PID parameters are not used, so do not expose them via debugfs in that case. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-12-27 03:30:11 +01:00
Linus Torvalds	7b9dc3f75f	Power management material for v4.10-rc1 - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding for it (Markus Mayer). - Support for ARM Integrator/AP and Integrator/CP in the generic DT cpufreq driver and elimination of the old Integrator cpufreq driver (Linus Walleij). - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier, and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie, Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik). - cpufreq core fix to eliminate races that may lead to using inactive policy objects and related cleanups (Rafael Wysocki). - cpufreq schedutil governor update to make it use SCHED_FIFO kernel threads (instead of regular workqueues) for doing delayed work (to reduce the response latency in some cases) and related cleanups (Viresh Kumar). - New cpufreq sysfs attribute for resetting statistics (Markus Mayer). - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis, Viresh Kumar). - Support for using generic cpufreq governors in the intel_pstate driver (Rafael Wysocki). - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy Performance Preference/Energy Performance Bias) knobs in the intel_pstate driver (Srinivas Pandruvada). - New CPU ID for Knights Mill in intel_pstate (Piotr Luc). - intel_pstate driver modification to use the P-state selection algorithm based on CPU load on platforms with the system profile in the ACPI tables set to "mobile" (Srinivas Pandruvada). - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki, Srinivas Pandruvada). - cpufreq powernv driver updates including fast switching support (for the schedutil governor), fixes and cleanus (Akshay Adiga, Andrew Donnellan, Denis Kirjanov). - acpi-cpufreq driver rework to switch it over to the new CPU offline/online state machine (Sebastian Andrzej Siewior). - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth Prakash). - Idle injection rework (to make it use the regular idle path instead of a home-grown custom one) and related powerclamp thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej Siewior). - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy Shevchenko, Piotr Luc). - intel_idle driver cleanups and switch over to using the new CPU offline/online state machine (Anna-Maria Gleixner, Sebastian Andrzej Siewior). - cpuidle DT driver update to support suspend-to-idle properly (Sudeep Holla). - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian, Rafael Wysocki). - Preliminary support for power domains including CPUs in the generic power domains (genpd) framework and related DT bindings (Lina Iyer). - Assorted fixes and cleanups in the generic power domains (genpd) framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven). - Preliminary support for devices with multiple voltage regulators and related fixes and cleanups in the Operating Performance Points (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd). - System sleep state selection interface rework to make it easier to support suspend-to-idle as the default system suspend method (Rafael Wysocki). - PM core fixes and cleanups, mostly related to the interactions between the system suspend and runtime PM frameworks (Ulf Hansson, Sahitya Tummala, Tony Lindgren). - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski). - New Knights Mill CPU ID for the Intel RAPL power capping driver (Piotr Luc). - Intel RAPL power capping driver fixes, cleanups and switch over to using the new CPU offline/online state machine (Jacob Pan, Thomas Gleixner, Sebastian Andrzej Siewior). - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc, rockchip-dfi devfreq drivers and the devfreq core (Axel Lin, Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar). - Fix for false-positive KASAN warnings during resume from ACPI S3 (suspend-to-RAM) on x86 (Josh Poimboeuf). - Memory map verification during resume from hibernation on x86 to ensure a consistent address space layout (Chen Yu). - Wakeup sources debugging enhancement (Xing Wei). - rockchip-io AVS driver cleanup (Shawn Lin). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJYTx4+AAoJEILEb/54YlRx9f8P/2SlNHUENW5qh6FtCw00oC2u UqJerQJ2L38UgbgxbE/0VYblma9rFABDWC1eO2xN2XdcdW5UPBKPVvNcOgNe1Clh gjy3RxZXVpmjfzt2kGfsTLEuGnHqwvx51hTUkeA2LwvkOal45xb8ZESmy8opCtiv iG4LwmPHoxdX5Za5nA9ItFKzxyO1EoyNSnBYAVwALDHxmNOfxEcRevfurASt/0M9 brCCZJA0/sZxeL0lBdy8fNQPIBTUfCoTJG/MtmzGrObJ9wMFvEDfXrVEyZiWs/zA AAZ4kQL77enrIKgrLN8e0G6LzTLHoVcvn38Xjf24dKUqhd7ACBhYcnW+jK3+7EAd gjZ8efObQsiuyK/EDLUNw35tt96CHOqfrQCj2tIwRVvk9EekLqAGXdIndTCr2kYW RpefmP5kMljnm/nQFOVLwMEUQMuVkvUE7EgxADy7DoDmepBFC4ICRDWPye70R2kC 0O1Tn2PAQq4Fd1tyI9TYYz0YQQkRoaRb5rfYUSzbRbeCdsphUopp4Vhsiyn6IcnF XnLbg6pRAat82MoS9n4pfO/VCo8vkErKA8tut9G7TDakkrJoEE7l31PdKW0hP3f6 sBo6xXy6WTeivU/o/i8TbM6K4mA37pBaj78ooIkWLgg5fzRaS2+0xSPVy2H9x1m5 LymHcobCK9rSZ1l208Fe =vhxI -----END PGP SIGNATURE----- Merge tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Again, cpufreq gets more changes than the other parts this time (one new driver, one old driver less, a bunch of enhancements of the existing code, new CPU IDs, fixes, cleanups) There also are some changes in cpuidle (idle injection rework, a couple of new CPU IDs, online/offline rework in intel_idle, fixes and cleanups), in the generic power domains framework (mostly related to supporting power domains containing CPUs), and in the Operating Performance Points (OPP) library (mostly related to supporting devices with multiple voltage regulators) In addition to that, the system sleep state selection interface is modified to make it easier for distributions with unchanged user space to support suspend-to-idle as the default system suspend method, some issues are fixed in the PM core, the latency tolerance PM QoS framework is improved a bit, the Intel RAPL power capping driver is cleaned up and there are some fixes and cleanups in the devfreq subsystem Specifics: - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding for it (Markus Mayer) - Support for ARM Integrator/AP and Integrator/CP in the generic DT cpufreq driver and elimination of the old Integrator cpufreq driver (Linus Walleij) - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier, and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie, Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik) - cpufreq core fix to eliminate races that may lead to using inactive policy objects and related cleanups (Rafael Wysocki) - cpufreq schedutil governor update to make it use SCHED_FIFO kernel threads (instead of regular workqueues) for doing delayed work (to reduce the response latency in some cases) and related cleanups (Viresh Kumar) - New cpufreq sysfs attribute for resetting statistics (Markus Mayer) - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis, Viresh Kumar) - Support for using generic cpufreq governors in the intel_pstate driver (Rafael Wysocki) - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy Performance Preference/Energy Performance Bias) knobs in the intel_pstate driver (Srinivas Pandruvada) - New CPU ID for Knights Mill in intel_pstate (Piotr Luc) - intel_pstate driver modification to use the P-state selection algorithm based on CPU load on platforms with the system profile in the ACPI tables set to "mobile" (Srinivas Pandruvada) - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki, Srinivas Pandruvada) - cpufreq powernv driver updates including fast switching support (for the schedutil governor), fixes and cleanus (Akshay Adiga, Andrew Donnellan, Denis Kirjanov) - acpi-cpufreq driver rework to switch it over to the new CPU offline/online state machine (Sebastian Andrzej Siewior) - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth Prakash) - Idle injection rework (to make it use the regular idle path instead of a home-grown custom one) and related powerclamp thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej Siewior) - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy Shevchenko, Piotr Luc) - intel_idle driver cleanups and switch over to using the new CPU offline/online state machine (Anna-Maria Gleixner, Sebastian Andrzej Siewior) - cpuidle DT driver update to support suspend-to-idle properly (Sudeep Holla) - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian, Rafael Wysocki) - Preliminary support for power domains including CPUs in the generic power domains (genpd) framework and related DT bindings (Lina Iyer) - Assorted fixes and cleanups in the generic power domains (genpd) framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven) - Preliminary support for devices with multiple voltage regulators and related fixes and cleanups in the Operating Performance Points (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd) - System sleep state selection interface rework to make it easier to support suspend-to-idle as the default system suspend method (Rafael Wysocki) - PM core fixes and cleanups, mostly related to the interactions between the system suspend and runtime PM frameworks (Ulf Hansson, Sahitya Tummala, Tony Lindgren) - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski) - New Knights Mill CPU ID for the Intel RAPL power capping driver (Piotr Luc) - Intel RAPL power capping driver fixes, cleanups and switch over to using the new CPU offline/online state machine (Jacob Pan, Thomas Gleixner, Sebastian Andrzej Siewior) - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc, rockchip-dfi devfreq drivers and the devfreq core (Axel Lin, Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar) - Fix for false-positive KASAN warnings during resume from ACPI S3 (suspend-to-RAM) on x86 (Josh Poimboeuf) - Memory map verification during resume from hibernation on x86 to ensure a consistent address space layout (Chen Yu) - Wakeup sources debugging enhancement (Xing Wei) - rockchip-io AVS driver cleanup (Shawn Lin)" * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits) devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks devfreq: rk3399_dmc: Remove dangling rcu_read_unlock() devfreq: exynos: Don't use OPP structures outside of RCU locks Documentation: intel_pstate: Document HWP energy/performance hints cpufreq: intel_pstate: Support for energy performance hints with HWP cpufreq: intel_pstate: Add locking around HWP requests PM / sleep: Print active wakeup sources when blocking on wakeup_count reads PM / core: Fix bug in the error handling of async suspend PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend PM / Domains: Fix compatible for domain idle state PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators() PM / OPP: Allow platform specific custom set_opp() callbacks PM / OPP: Separate out _generic_set_opp() PM / OPP: Add infrastructure to manage multiple regulators PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage() PM / OPP: Manage supply's voltage/current in a separate structure PM / OPP: Don't use OPP structure outside of rcu protected section PM / OPP: Reword binding supporting multiple regulators per device PM / OPP: Fix incorrect cpu-supply property in binding cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state() ..	2016-12-13 10:41:53 -08:00
Srinivas Pandruvada	984edbdccc	cpufreq: intel_pstate: Support for energy performance hints with HWP It is possible to provide hints to the HWP algorithms in the processor to be more performance centric to more energy centric. These hints are provided by using HWP energy performance preference (EPP) or energy performance bias (EPB) settings. The scope of these settings is per logical processor, which means that each of the logical processors in the package can be programmed with a different value. This change provides cpufreq sysfs interface to provide hint. For each policy, two additional attributes will be available to check and provide hint. These attributes will only be present when the intel_pstate driver is using HWP mode. These attributes are: - energy_performance_available_preferences - energy_performance_preference To get list of supported hints: $ cat energy_performance_available_preferences default performance balance_performance balance_power power The current preference can be read or changed via cpufreq sysfs attribute "energy_performance_preference". Reading from this attribute will display current effective setting changed via any method. User can write any of the valid preference string to this attribute. User can always restore to power-on default by writing "default". Implementation Since these hints can be provided by direct MSR write or using some tools like x86_energy_perf_policy, the driver internally doesn't maintain any state. The user operation will result in direct read/write of MSR: 0x774 (HWP_REQUEST_MSR). Also driver use read modify write to update other fields in this MSR. Summary of changes: - struct cpudata field epp_saved is renamed to epp_powersave, as this stores the value to restore once policy is switched from performance to powersave to restore original powersave EPP value. - A new struct cpudata field epp_saved is used to store the raw MSR EPP/EPB value when a CPU goes offline or on suspend and restore on online/resume. This ensures that EPP value is restored to correct value irrespective of the means used to set. - EPP/EPB value ranges are fixed for each preference, which can be set for the cpufreq sysfs, so user request is mapped to/from this range. - New attributes are only added when HWP is present. - Since EPP value of 0 is valid the fields are initialized to -EINVAL when not valid. The field epp_default is read only once after powerup to avoid reading on subsequent CPU online operation - New suspend callback to store epp on suspend operation - Don't invalidate old epp_saved field on resume and online as now we can restore last epp value on suspend and this field can still have old EPP value sampled during switch to performance from powersave. - While here optimized setting of cpu_data->epp_powersave = epp in intel_pstate_hwp_set() as this was done in both true and false paths. - epp/epb set function returns error to caller on failure to pass on to user space for display. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-12-08 01:43:05 +01:00
Srinivas Pandruvada	b59fe54053	cpufreq: intel_pstate: Add locking around HWP requests To avoid race conditions from multiple threads, increase the scope of intel_pstate_limits_lock to include HWP requests also. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-12-08 01:43:04 +01:00
Piotr Luc	58bf454272	cpufreq: intel_pstate: Add Knights Mill CPUID Add Knights Mill (KNM) to the list of CPUIDs supported by intel_pstate. Signed-off-by: Piotr Luc <piotr.luc@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-12-01 15:08:12 +01:00
Arnd Bergmann	7a3ba767f6	cpufreq: intel_pstate: fix intel_pstate_exit_perf_limits() prototype The addition of the generic governor support marked the intel_pstate_exit_perf_limits as inline(), which fixed a warning, but it introduced another warning: drivers/cpufreq/intel_pstate.c: In function ‘intel_pstate_exit_perf_limits’: drivers/cpufreq/intel_pstate.c:483:1: error: no return statement in function returning non-void [-Werror=return-type] This changes it back to a 'void' return type, and changes the corresponding intel_pstate_init_acpi_perf_limits() function to be inline as well for consistency. Fixes: `001c76f05b` (cpufreq: intel_pstate: Generic governors support) Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-28 14:24:21 +01:00
Srinivas Pandruvada	8442885fca	cpufreq: intel_pstate: Set EPP/EPB to 0 in performance mode When user has selected performance policy, then set the EPP (Energy Performance Preference) or EPB (Energy Performance Bias) to maximum performance mode. Also when user switch back to powersave, then restore EPP/EPB to last EPP/EPB value before entering performance mode. If user has not changed EPP/EPB manually then it will be power on default value. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-28 14:23:56 +01:00
Rafael J. Wysocki	17669006ad	cpufreq/intel_pstate: Use CPPC to get max performance Use the acpi cppc_lib interface to get CPPC performance limits and update the per cpu priority for the ITMT scheduler. If the highest performance of CPUs differs the ITMT feature is enabled. Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: linux-pm@vger.kernel.org Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/0998b98943bcdec7d1ddd4ff27358da555ea8e92.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-24 20:44:20 +01:00
Srinivas Pandruvada	d5dd33d9de	cpufreq: intel_pstate: increase precision of performance limits Even with round up of limits->min_perf and limits->max_perf, in some cases resultant performance is 100 MHz less than the desired. For example when the maximum frequency is 3.50 GHz, setting scaling_min_frequency to 2.3 GHz always results in 2.2 GHz minimum. Currently the fixed floating point operation uses 8 bit precision for calculating limits->min_perf and limits->max_perf. For some operations in this driver the 14 bit precision is used. Using the 14 bit precision also for calculating limits->min_perf and limits->max_perf, addresses this issue. Introduced fp_ext_toint() equivalent to fp_toint() and int_ext_tofp() equivalent to int_tofp() with 14 bit precision. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-22 02:31:49 +01:00
Srinivas Pandruvada	46992d6b55	cpufreq: intel_pstate: round up min_perf limits In some use cases, user wants to enforce a minimum performance limit on CPUs. But because of simple division the resultant performance is 100 MHz less than the desired in some cases. For example when the maximum frequency is 3.50 GHz, setting scaling_min_frequency to 1.6 GHz always results in 1.5 GHz minimum. With simple round up, the frequency can be set to 1.6 GHz to minimum in this case. This round up is already done to max_policy_pct and max_perf, so do the same for min_policy_pct and min_perf. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-22 02:31:48 +01:00
Rafael J. Wysocki	001c76f05b	cpufreq: intel_pstate: Generic governors support There may be reasons to use generic cpufreq governors (eg. schedutil) on Intel platforms instead of the intel_pstate driver's internal governor. However, that currently can only be done by disabling intel_pstate altogether and using the acpi-cpufreq driver instead of it, which is subject to limitations. First of all, acpi-cpufreq only works on systems where the _PSS object is present in the ACPI tables for all logical CPUs. Second, on those systems acpi-cpufreq will only use frequencies listed by _PSS which may be suboptimal. In particular, by convention, the whole turbo range is represented in _PSS as a single P-state and the frequency assigned to it is greater by 1 MHz than the greatest non-turbo frequency listed by _PSS. That may confuse governors to use turbo frequencies less frequently which may lead to suboptimal performance. For this reason, make it possible to use the intel_pstate driver with generic cpufreq governors as a "normal" cpufreq driver. That mode is enforced by adding intel_pstate=passive to the kernel command line and cannot be disabled at run time. In that mode, intel_pstate provides a cpufreq driver interface including the ->target() and ->fast_switch() callbacks and is listed in scaling_driver as "intel_cpufreq". Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Doug Smythies <dsmythies@telus.net>	2016-11-21 14:32:32 +01:00
Rafael J. Wysocki	d0ea59e188	cpufreq: intel_pstate: Request P-states control from SMM if needed Currently, intel_pstate is unable to control P-states on my IvyBridge-based Acer Aspire S5, because they are controlled by SMM on that machine by default and it is necessary to request OS control of P-states from it via the SMI Command register exposed in the ACPI FADT. intel_pstate doesn't do that now, but acpi-cpufreq and other cpufreq drivers for x86 platforms do. Address this problem by making intel_pstate use the ACPI-defined mechanism as well. However, intel_pstate is not modular and it doesn't need the module refcount tricks played by acpi_processor_notify_smm(), so export the core of this function to it as acpi_processor_pstate_control() and make it call that. [The changes in processor_perflib.c related to this should not make any functional difference for the acpi_processor_notify_smm() users]. To be safe, only call acpi_processor_notify_smm() from intel_pstate if ACPI _PPC support is enabled in it. Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-11-17 22:47:47 +01:00
Srinivas Pandruvada	7f7a516ee3	cpufreq: intel_pstate: Use CPU load based algorithm for PM_MOBILE Use get_target_pstate_use_cpu_load() to calculate target P-State for devices, with the preferred power management profile in ACPI FADT set to PM_MOBILE. This may help in resolving some thermal issues caused by low sustained cpu bound workloads. The current algorithm tend to over provision in this case as it doesn't look at the CPU busyness. Also included the fix from Arnd Bergmann <arnd@arndb.de> to solve compile issue, when CONFIG_ACPI is not defined. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-14 21:25:23 +01:00
Srinivas Pandruvada	a410c03d66	cpufreq: intel_pstate: protect limits variable The limits variable gets modified from intel_pstate sysfs and also gets modified from cpufreq sysfs. So protect with a mutex to keep data integrity, when they are getting modified from multiple threads. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-01 06:10:54 +01:00
Srinivas Pandruvada	5879f87739	cpufreq: intel_pstate: Reduce impact due to rounding error When policy->max and policy->min are same, in some cases they don't result in the same frequency cap. The max_policy_pct is rounded up but not min_perf_pct. So even when they are same, results in different percentage or maximum and minimum. Since minimum is a conservative value for power, a lower value without rounding is better in most of the cases, unless user wants policy->max = policy->min. This change uses use the same policy percentage when policy->max and policy->min are same. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-01 06:04:06 +01:00
Srinivas Pandruvada	eae48f046f	cpufreq: intel_pstate: Per CPU P-State limits Intel P-State offers two interface to set performance limits: - Intel P-State sysfs /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - cpufreq /sys/devices/system/cpu/cpu/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu/cpufreq/scaling_min_freq In the current implementation both of the above methods, change limits to every CPU in the system. Moreover the limits placed using cpufreq policy interface also presented in the Intel P-State sysfs via modified max_perf_pct and min_per_pct during sysfs reads. This allows to check percent of reduced/increased performance, irrespective of method used to limit. There are some new generations of processors, where it is possible to have limits placed on individual CPU cores. Using cpufreq interface it is possible to set limits on each CPU. But the current processing will use last limits placed on all CPUs. So the per core limit feature of CPUs can't be used. This change brings in capability to set P-States limits for each CPU, with some limitations. In this case what should be the read of max_perf_pct and min_perf_pct? It can be most restrictive limits placed on any CPU or max possible performance on any given CPU on which no limits are placed. In either case someone will have issue. So the consensus is, we can't have both sysfs controls present when user wants to use limit per core limits. - By default per-core-control feature is not enabled. So no one will notice any difference. - The way to enable is by kernel command line intel_pstate=per_cpu_perf_limits - When the per-core-controls are enabled there is no display of for both read and write on /sys/devices/system/cpu/intel_pstate/max_perf_pct /sys/devices/system/cpu/intel_pstate/min_perf_pct - User can change limits using /sys/devices/system/cpu/cpu/cpufreq/scaling_max_freq /sys/devices/system/cpu/cpu/cpufreq/scaling_min_freq /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - User can still observe turbo percent and number of P-States from /sys/devices/system/cpu/intel_pstate/turbo_pct /sys/devices/system/cpu/intel_pstate/num_pstates - User can read write system wide turbo status /sys/devices/system/cpu/no_turbo While changing this BUG_ON is changed to WARN_ON, as they are not fatal errors for the system. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-01 06:04:06 +01:00
Rafael J. Wysocki	fe0f59c412	Merge back earlier cpufreq material for v4.10.	2016-10-30 06:12:50 +01:00
Rafael J. Wysocki	2f1d407ada	cpufreq: intel_pstate: Always set max P-state in performance mode The only times at which intel_pstate checks the policy set for a given CPU is the initialization of that CPU and updates of its policy settings from cpufreq when intel_pstate_set_policy() is invoked. That is insufficient, however, because intel_pstate uses the same P-state selection function for all CPUs regardless of the policy setting for each of them and the P-state limits are shared between them. Thus if the policy is set to "performance" for a particular CPU, it may not behave as expected if the cpufreq settings are changed subsequently for another CPU. That can be easily demonstrated by writing "performance" to scaling_governor for all CPUs and then switching it to "powersave" for one of them in which case all of the CPUs will behave as though their scaling_governor were all "powersave" (even though the policy still appears to be "performance" for the remaining CPUs). Fix this problem by modifying intel_pstate_adjust_busy_pstate() to always set the P-state to the maximum allowed by the current limits for all CPUs whose policy is set to "performance". Note that it still is recommended to always change the policy setting in the same way for all CPUs even with this fix applied to avoid confusion. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-24 23:20:25 +02:00
Rafael J. Wysocki	a6c6ead141	cpufreq: intel_pstate: Set P-state upfront in performance mode After commit `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) the cpufreq governor callbacks may not be invoked on NOHZ_FULL CPUs and, in particular, switching to the "performance" policy via sysfs may not have any effect on them. That is a problem, because it usually is desirable to squeeze the last bit of performance out of those CPUs, so work around it by setting the maximum P-state (within the limits) in intel_pstate_set_policy() upfront when the policy is CPUFREQ_POLICY_PERFORMANCE. Fixes: `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-10-21 22:18:22 +02:00
Srinivas Pandruvada	185d82456e	cpufreq: intel_pstate: Remove PID debugfs when not used When target state is calculated using get_target_pstate_use_cpu_load(), PID controller is not used, hence it has no effect on performance. So don't present debugfs entries to tune PID controller. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-21 22:16:26 +02:00
Rafael J. Wysocki	1d29815ef2	cpufreq: intel_pstate: Drop boost_iowait flag The "IOwait boosting" mechanism is only used by the get_target_pstate_use_cpu_load() governor function and the boost_iowait flag in pid_params is always set when that function is in use (and it is never set otherwise). This means that the boost_iowait flag is in fact redundant and may be dropped. For this reason, replace the boost_iowait flag check in intel_pstate_update_util() with an equivalent check against pstate_funcs.get_target_pstate and drop that flag. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-10-21 22:13:51 +02:00
Rafael J. Wysocki	3954517e2f	cpufreq: intel_pstate: Fix struct pstate_adjust_policy kerneldoc It looks like the name of struct pstate_adjust_policy was updated without updating its kerneldoc comment accordingly, so fix that mistake. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-12 20:58:14 +02:00
Rafael J. Wysocki	0843e83c1a	cpufreq: intel_pstate: Proportional algorithm for Atom The PID algorithm used by the intel_pstate driver tends to drive performance to the minimum for workloads with utilization below the setpoint, which is undesirable, so replace it with a modified "proportional" algorithm on Atom. The new algorithm will set the new P-state to be 1.25 times the available maximum times the (frequency-invariant) utilization during the previous sampling period except when the target P-state computed this way is lower than the average P-state during the previous sampling period. In the latter case, it will increase the target by 50% of the difference between it and the average P-state to prevent performance from dropping down too fast in some cases. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-10-12 20:58:13 +02:00
Rafael J. Wysocki	f00593a4bd	cpufreq: intel_pstate: Clarify comment in get_target_pstate_use_performance() Make the comment explaining the meaning of the perf_scaled variable in get_target_pstate_use_performance() more straightforward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-09 18:54:57 +02:00
Srinivas Pandruvada	f9f4872df6	cpufreq: intel_pstate: Fix unsafe HWP MSR access This is a requirement that MSR MSR_PM_ENABLE must be set to 0x01 before reading MSR_HWP_CAPABILITIES on a given CPU. If cpufreq init() is scheduled on a CPU which is not same as policy->cpu or migrates to a different CPU before calling msr read for MSR_HWP_CAPABILITIES, it is possible that MSR_PM_ENABLE was not to set to 0x01 on that CPU. This will cause GP fault. So like other places in this path rdmsrl_on_cpu should be used instead of rdmsrl. Moreover the scope of MSR_HWP_CAPABILITIES is on per thread basis, so it should be read from the same CPU, for which MSR MSR_HWP_REQUEST is getting set. dmesg dump or warning: [ 22.014488] WARNING: CPU: 139 PID: 1 at arch/x86/mm/extable.c:50 ex_handler_rdmsr_unsafe+0x68/0x70 [ 22.014492] unchecked MSR access error: RDMSR from 0x771 [ 22.014493] Modules linked in: [ 22.014507] CPU: 139 PID: 1 Comm: swapper/0 Not tainted 4.7.5+ #1 ... ... [ 22.014516] Call Trace: [ 22.014542] [<ffffffff813d7dd1>] dump_stack+0x63/0x82 [ 22.014558] [<ffffffff8107bc8b>] __warn+0xcb/0xf0 [ 22.014561] [<ffffffff8107bcff>] warn_slowpath_fmt+0x4f/0x60 [ 22.014563] [<ffffffff810676f8>] ex_handler_rdmsr_unsafe+0x68/0x70 [ 22.014564] [<ffffffff810677d9>] fixup_exception+0x39/0x50 [ 22.014604] [<ffffffff8102e400>] do_general_protection+0x80/0x150 [ 22.014610] [<ffffffff817f9ec8>] general_protection+0x28/0x30 [ 22.014635] [<ffffffff81687940>] ? get_target_pstate_use_performance+0xb0/0xb0 [ 22.014642] [<ffffffff810600c7>] ? native_read_msr+0x7/0x40 [ 22.014657] [<ffffffff81688123>] intel_pstate_hwp_set+0x23/0x130 [ 22.014660] [<ffffffff81688406>] intel_pstate_set_policy+0x1b6/0x340 [ 22.014662] [<ffffffff816829bb>] cpufreq_set_policy+0xeb/0x2c0 [ 22.014664] [<ffffffff81682f39>] cpufreq_init_policy+0x79/0xe0 [ 22.014666] [<ffffffff81682cb0>] ? cpufreq_update_policy+0x120/0x120 [ 22.014669] [<ffffffff816833a6>] cpufreq_online+0x406/0x820 [ 22.014671] [<ffffffff8168381f>] cpufreq_add_dev+0x5f/0x90 [ 22.014717] [<ffffffff81530ac8>] subsys_interface_register+0xb8/0x100 [ 22.014719] [<ffffffff816821bc>] cpufreq_register_driver+0x14c/0x210 [ 22.014749] [<ffffffff81fe1d90>] intel_pstate_init+0x39d/0x4d5 [ 22.014751] [<ffffffff81fe13f2>] ? cpufreq_gov_dbs_init+0x12/0x12 Cc: 4.3+ <stable@vger.kernel.org> # 4.3+ Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-10-09 18:54:13 +02:00
Rafael J. Wysocki	b6e2511782	Merge branch 'pm-cpufreq-sched' into pm-cpufreq	2016-10-02 01:42:33 +02:00
Srinivas Pandruvada	3ba7bcaa36	cpufreq: intel_pstate: Add io_boost trace Add io_boost percent to current pstate_sample tracepoint. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-09-16 23:55:30 +02:00
Rafael J. Wysocki	09c448d3c6	cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm Modify the P-state selection algorithm for Atom processors to use the new SCHED_CPUFREQ_IOWAIT flag instead of the questionable get_cpu_iowait_time_us() function. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-09-14 02:28:13 +02:00
Julia Lawall	42ce8921cc	intel_pstate: constify local structures For structure types defined in the same file or local header files, find top-level static structure declarations that have the following properties: 1. Never reassigned. 2. Address never taken 3. Not passed to a top-level macro call 4. No pointer or array-typed field passed to a function or stored in a variable. Declare structures having all of these properties as const. Done using Coccinelle. Based on a suggestion by Joe Perches <joe@perches.com>. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-09-13 02:40:24 +02:00
Rafael J. Wysocki	58919e83c8	cpufreq / sched: Pass flags to cpufreq_update_util() It is useful to know the reason why cpufreq_update_util() has just been called and that can be passed as flags to cpufreq_update_util() and to the ->func() callback in struct update_util_data. However, doing that in addition to passing the util and max arguments they already take would be clumsy, so avoid it. Instead, use the observation that the schedutil governor is part of the scheduler proper, so it can access scheduler data directly. This allows the util and max arguments of cpufreq_update_util() and the ->func() callback in struct update_util_data to be replaced with a flags one, but schedutil has to be modified to follow. Thus make the schedutil governor obtain the CFS utilization information from the scheduler and use the "RT" and "DL" flags instead of the special utilization value of ULONG_MAX to track updates from the RT and DL sched classes. Make it non-modular too to avoid having to export scheduler variables to modules at large. Next, update all of the other users of cpufreq_update_util() and the ->func() callback in struct update_util_data accordingly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-08-16 22:14:55 +02:00
Rafael J. Wysocki	e2b3b80de5	Merge branches 'pm-sleep', 'pm-cpufreq', 'pm-core' and 'pm-opp' * pm-sleep: x86/power/64: Do not refer to __PAGE_OFFSET from assembly code * pm-cpufreq: cpufreq: Do not default-yes CPU_FREQ_STAT cpufreq: intel_pstate: Add more out-of-band IDs * pm-core: PM-wakeup: Delete unnecessary checks before three function calls * pm-opp: PM / OPP: optimize dev_pm_opp_set_rate() performance a bit	2016-08-05 15:46:55 +02:00
Srinivas Pandruvada	65c1262f40	cpufreq: intel_pstate: Add more out-of-band IDs Add Skylake-X and Broadwell-X IDs for out-of-band (OBB) control of P-States. For these processors, if MSR_MISC_PWR_MGMT BIT(8) == 1, then the Intel P-State driver should exit as OS can't control P-States. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Subject/changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-07-28 23:58:17 +02:00
Rafael J. Wysocki	bc841e260c	Merge branch 'pm-cpu' * pm-cpu: x86: remove duplicate turbo ratio limit MSRs tools/power turbostat: Replace MSR_NHM_TURBO_RATIO_LIMIT cpufreq: intel_pstate: Replace MSR_NHM_TURBO_RATIO_LIMIT	2016-07-25 13:46:30 +02:00
Srinivas Pandruvada	da7de91c3e	cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT The MSR MSR_HWP_INTERRUPT is valid only when CPUID.06H:EAX[8] = 1, so check for feature before accessing this MSR. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-07-21 14:29:30 +02:00
Rafael J. Wysocki	bc95a454b6	intel_pstate: Update cpu_frequency tracepoint every time Currently, intel_pstate only updates the cpu_frequency tracepoint if the new P-state to set is different from the current one, but that causes powertop to report 100% idle on an 100% loaded system sometimes. Prevent that from happening by updating the cpu_frequency tracepoint every time intel_pstate_update_pstate() is called. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>-	2016-07-21 14:28:37 +02:00
Carsten Emde	2630abc243	cpufreq: intel_pstate: clean remnant struct element When I was working with the Intel P state driver I came across a remnant struct element that is no longer needed after the function intel_pstate_calc_freq() was retired. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-07-21 14:26:00 +02:00
Jan Kiszka	5fc8f707a2	intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate() If MSR_CONFIG_TDP_CONTROL is locked, we currently try to address some MSR 0x80000648 or so. Mask out the relevant level bits 0 and 1. Found while running over the Jailhouse hypervisor which became upset about this strange MSR index. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-07-11 15:12:30 +02:00
Srinivas Pandruvada	100cf6f277	cpufreq: intel_pstate: Replace MSR_NHM_TURBO_RATIO_LIMIT Replace MSR_NHM_TURBO_RATIO_LIMIT with MSR_TURBO_RATIO_LIMIT. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-07-07 15:31:58 +02:00
Rafael J. Wysocki	5d1191ab6c	Merge back earlier cpufreq material for v4.8.	2016-07-04 13:21:43 +02:00
Jisheng Zhang	4a7cb7a96a	intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostly pid_params is written once by copy_pid_params() during initialization, and thereafter is mostly read by hot path intel_pstate_update_util(). The read of pid_params gets more after commit `a4675fbc4a` ("cpufreq: intel_pstate: Replace timers with utilization update callbacks") pstate_funcs is written once by copy_cpu_funcs() during initialization, and thereafter is mostly read by hot path intel_pstate_update_util() hwp_active is written to once during initialization and thereafter is mostly read by hot path intel_pstate_update_util(). The fact that they are mostly read and not written to makes them candidates for __read_mostly declarations. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-28 00:04:04 +02:00
Jisheng Zhang	29327c84ba	intel_pstate: add __init/__initdata marker to some functions/variables These functions/variables are not needed after booting, so mark them as __init or __initdata. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-28 00:04:04 +02:00
Jisheng Zhang	eed436095e	intel_pstate: Fix incorrect placement of __initdata __initdata should be placed between the variable name and equal sign (if there is) for the variable to be placed in the intended section. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-28 00:04:04 +02:00
Rafael J. Wysocki	5ab666e095	intel_pstate: Do not clear utilization update hooks on policy changes intel_pstate_set_policy() is invoked by the cpufreq core during driver initialization, on changes of policy attributes (minimim and maximum frequency, for example) via sysfs and via CPU notifications from the platform firmware. On some platforms the latter may occur relatively often. Commit `bb6ab52f2b` (intel_pstate: Do not set utilization update hook too early) made intel_pstate_set_policy() clear the CPU's utilization update hook before updating the policy attributes for it (and set the hook again after doind that), but that involves invoking synchronize_sched() and adds overhead to the CPU notifications mentioned above and to the sched-RCU handling in general. That extra overhead is arguably not necessary, because updating policy attributes when the CPU's utilization update hook is active should not lead to any adverse effects, so drop the clearing of the hook from intel_pstate_set_policy() and make it check if the hook has been set already when attempting to set it. Fixes: `bb6ab52f2b` (intel_pstate: Do not set utilization update hook too early) Reported-by: Jisheng Zhang <jszhang@marvell.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-27 23:47:15 +02:00
Rafael J. Wysocki	56b7808572	Merge back earlier cpufreq changes for v4.8.	2016-06-20 14:31:41 +02:00
Srinivas Pandruvada	b00345d199	cpufreq: intel_pstate: Adjust _PSS[0] freqeuency if needed The maximum turbo P-State used by the intel_pstate driver may be limited by ACPI _PSS table entry 0. After commit `9522a2ff9c` (cpufreq: intel_pstate: Enforce _PPC limits), the maximum performance on servers will be capped by the _PSS table entry 0 by default. Even though that is formally correct, it may lead to preformance regressions in some cases. Namely, if the _PSS table entry 0 is not the maximum turbo P-State, performance measured after commit `9522a2ff9c` will not match the performance measured before that commit on the same system. For this reason, modify the code to always use the maximum turbo frequency as the one that corresponds to _PSS table entry 0 if turbo is enabled in the BIOS. This way, the performance levels from before commit `9522a2ff9c` will be restored on the affected systems. Fixes: `9522a2ff9c` (cpufreq: intel_pstate: Enforce _PPC limits) Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-15 01:56:47 +02:00
Srinivas Pandruvada	41bad47f76	cpufreq: intel_pstate: Broxton support Add Broxton CPU model number. Broxton requires core_params to get performance limits via MSRs, but it is an Atom platform, which requires more power optimized algorithm. So the P state selection will use similar algorithm as other Atom platforms. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-13 23:49:39 +02:00
Rafael J. Wysocki	b77b565108	Merge branch 'x86/cpu' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into x86/cpu Pull recent changes related to x86 CPU model representations from tip.	2016-06-13 23:48:23 +02:00
Dave Hansen	5b20c94488	x86/cpufreq: Use Intel family name macros for the intel_pstate cpufreq driver Another straightforward replacement of magic numbers. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: jacob.jun.pan@intel.com Cc: linux-pm@vger.kernel.org Link: http://lkml.kernel.org/r/20160603001945.0F5D02AA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-06-08 13:03:26 +02:00
Srinivas Pandruvada	983e600e88	cpufreq: intel_pstate: Fix ->set_policy() interface for no_turbo When turbo is disabled, the ->set_policy() interface is broken. For example, when turbo is disabled and cpuinfo.max = `2900000` (full max turbo frequency), setting the limits results in frequency less than the requested one: Set 1000000 KHz results in 0700000 KHz Set 1500000 KHz results in 1100000 KHz Set 2000000 KHz results in 1500000 KHz This is because the limits->max_perf fraction is calculated using the max turbo frequency as the reference, but when the max P-State is capped in intel_pstate_get_min_max(), the reference is not the max turbo P-State. This results in reducing max P-State. One option is to always use max turbo as reference for calculating limits. But this will not be correct. By definition the intel_pstate sysfs limits, shows percentage of available performance. So when BIOS has disabled turbo, the available performance is max non turbo. So the max_perf_pct should still show 100%. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Subject & changelog, rewrite in fewer lines of code ] Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-08 03:22:40 +02:00
Srinivas Pandruvada	2c2c1af449	cpufreq: intel_pstate: Fix code ordering in intel_pstate_set_policy() The limits->max_perf is rounded_up but immediately overwritten by another assignment to limits->max_perf. Move that operation to the correct location. While here also added a pr_debug() call in ->set_policy to aid in debugging. Fixes: `785ee27881` (cpufreq: intel_pstate: Fix limits->max_perf rounding error) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Subject & changelog ] Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-08 03:22:39 +02:00
Srinivas Pandruvada	6cacd115a8	cpufreq: intel_pstate: Downgrade print level for _PPC Downgrade pr_info to pr_debug for the "_PPC limits will be enforced" message. In server systems with many cores this message is annoying. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-30 15:22:02 +02:00
Rafael J. Wysocki	c749c64f45	intel_pstate: Simplify conditional in intel_pstate_set_policy() One of the if () statements in intel_pstate_set_policy() causes another if () to be evaluated if the condition is true and it doesn't do anything else, so merge the two if () statements into one. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-05-18 02:26:33 +02:00
Rafael J. Wysocki	1aa7a6e2b8	intel_pstate: Clean up get_target_pstate_use_performance() The comments and the core_busy variable name in get_target_pstate_use_performance() are totally confusing, so modify them to reflect what's going on. The results of the computations should be the same as before. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-11 22:58:38 +02:00
Rafael J. Wysocki	8edb0a6e48	intel_pstate: Use sample.core_avg_perf in get_avg_pstate() Notice that get_avg_pstate() can use sample.core_avg_perf instead of carrying the same division again, so make it do that. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-11 22:58:37 +02:00
Rafael J. Wysocki	a1c9787dc3	intel_pstate: Clarify average performance computation The core_pct_busy field of struct sample actually contains the average performace during the last sampling period (in percent) and not the utilization of the core as suggested by its name which is confusing. For this reason, change the name of that field to core_avg_perf and rename the function that computes its value accordingly. Also notice that storing this value as percentage requires a costly integer multiplication to be carried out in a hot path, so instead store it as an "extended fixed point" value with more fraction bits and update the code using it accordingly (it is better to change the name of the field along with its meaning in one go than to make those two changes separately, as that would likely lead to more confusion). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-11 22:58:37 +02:00
Chen Yu	4578ee7e1d	intel_pstate: Avoid unnecessary synchronize_sched() during initialization Currently, in intel_pstate_clear_update_util_hook(), after clearing the utilization update hook, we leverage synchronize_sched() to deal with synchronization, which is a little bit time-costly because synchronize_sched() has to wait for all the CPUs to go through a grace period. Actually, the synchronize_sched() is not necessary if the utilization update hook has not been set for the given CPU yet, so make the driver check if that's the case and avoid the synchronize_sched() call then. Link: https://bugzilla.kernel.org/show_bug.cgi?id=116371 Tested-by: Tian Ye <yex.tian@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> [ rjw : Rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-11 22:56:34 +02:00
Rafael J. Wysocki	f96fd0c86f	intel_pstate: Clean up intel_pstate_get() intel_pstate_get() contains a local variable that's initialized but never used and it can be written in fewer lines of code, so clean it up. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-05-09 22:55:36 +02:00
Rafael J. Wysocki	da43af961b	Merge cpufreq fixes going into v4.6. * pm-cpufreq-fixes: intel_pstate: Fix intel_pstate_get() cpufreq: intel_pstate: Fix HWP on boot CPU after system resume cpufreq: st: enable selective initialization based on the platform cpufreq: intel_pstate: Fix processing for turbo activation ratio	2016-05-06 22:01:14 +02:00
Srinivas Pandruvada	e59a8f7ff4	cpufreq: intel_pstate: Ignore _PPC processing under HWP When HWP (hardware P states) feature is active, the ACPI _PSS and _PPC is not used. So ignore processing for _PPC limits. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-05 01:43:47 +02:00
Rafael J. Wysocki	6d45b719cb	intel_pstate: Fix intel_pstate_get() After commit `8fa520af50` "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()" intel_pstate_get() calls get_avg_frequency() to compute the average frequency, which is problematic for two reasons. First, intel_pstate_get() may be invoked before the driver reads the CPU feedback registers for the first time and if that happens, get_avg_frequency() will attempt to divide by zero. Second, the get_avg_frequency() call in intel_pstate_get() is racy with respect to intel_pstate_sample() and it may end up returning completely meaningless values for this reason. Moreover, after commit `7349ec0470` "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()" sample.core_pct_busy is never computed on Atom, but it is used in intel_pstate_adjust_busy_pstate() in that case too. To address those problems notice that if sample.core_pct_busy was used in the average frequency computation carried out by get_avg_frequency(), both the divide by zero problem and the race with respect to intel_pstate_sample() would be avoided. Accordingly, move the invocation of intel_pstate_calc_busy() from get_target_pstate_use_performance() to intel_pstate_update_util(), which also will take care of the uninitialized sample.core_pct_busy on Atom, and modify get_avg_frequency() to use sample.core_pct_busy as per the above. Reported-by: kernel test robot <ying.huang@linux.intel.com> Link: http://marc.info/?l=linux-kernel&m=146226437623173&w=4 Fixes: `8fa520af50` "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()" Fixes: `7349ec0470` "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()" Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-05-04 14:09:16 +02:00
Rafael J. Wysocki	ba41e1bc28	cpufreq: intel_pstate: Fix HWP on boot CPU after system resume Commit `41cfd64cf4` "Update frequencies of policy->cpus only from ->set_policy()" changed the way the intel_pstate driver's ->set_policy callback updates the HWP (hardware-managed P-states) settings. A side effect of it is that if those settings are modified on the boot CPU during system suspend and wakeup, they will never be restored during subsequent system resume. To address this problem, allow cpufreq drivers that don't provide ->target or ->target_index callbacks to use ->suspend and ->resume callbacks and add a ->resume callback to intel_pstate to restore the HWP settings on the CPUs that belong to the given policy. Fixes: `41cfd64cf4` "Update frequencies of policy->cpus only from ->set_policy()" Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-05-02 13:48:15 +02:00
Srinivas Pandruvada	2b3ec76505	cpufreq: intel_pstate: Enable PPC enforcement for servers For platforms which are controlled via remove node manager, enable _PPC by default. These platforms are mostly categorized as enterprise server or performance servers. These platforms needs to go through some certifications tests, which tests control via _PPC. The relative risk of enabling by default is low as this is is less likely that these systems have broken _PSS table. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-28 01:01:39 +02:00
Srinivas Pandruvada	3be9200d51	cpufreq: intel_pstate: Adjust policy->max When policy->max is changed via _PPC or sysfs and is more than the max non turbo frequency, it does not really change resulting performance in some processors. When policy->max results in a P-State ratio more than the turbo activation ratio, then processor can choose any P-State up to max turbo. So the user or _PPC setting has no value, but this can cause undesirable side effects like: - Showing reduced max percentage in Intel P-State sysfs - It can cause reduced max performance under certain boundary conditions: The requested max scaling frequency either via _PPC or via cpufreq-sysfs, will be converted into a fixed floating point max percent scale. In majority of the cases this will result in correct max. But not 100% of the time. If the _PPC is requested at a point where the calculation lead to a lower max, this can result in a lower P-State then expected and it will impact performance. Example of this condition using a Broadwell laptop with config TDP. ACPI _PSS table from a Broadwell laptop `2301000` 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000 1300000 1100000 1000000 900000 800000 600000 500000 The actual results by disabling config TDP so that we can get what is requested on or below 2300000Khz. scaling_max_freq Max Requested P-State Resultant scaling max ---------------------------------------- ---------------------- 2400000 18 `2900000` (max turbo) 2300000 17 2300000 (max physical non turbo) 2200000 15 2100000 2100000 15 2100000 2000000 13 1900000 1900000 13 1900000 1800000 12 1800000 1700000 11 1700000 1600000 10 1600000 1500000 f 1500000 1400000 e 1400000 1300000 d 1300000 1200000 c 1200000 1100000 a 1000000 1000000 a 1000000 900000 9 900000 800000 8 800000 700000 7 700000 600000 6 600000 500000 5 500000 ------------------------------------------------------------------ Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz) in BIOS (not every system will let you adjust this). The turbo activation ratio will be set to one less than that, which will be 0x0a (So any request above 1000000KHz should result in turbo region assuming no thermal limits). Here _PPC will request max to 1100000KHz (which basically should still result in turbo as this is more than the turbo activation ratio up to max allowable turbo frequency), but actual calculation resulted in a max ceiling P-State which is 0x0a. So under any load condition, this driver will not request turbo P-States. This will be a huge performance hit. When config TDP feature is ON, if the _PPC points to a frequency above turbo activation ratio, the performance can still reach max turbo. In this case we don't need to treat this as the reduced frequency in set_policy callback. In this change when config TDP is active (by checking if the physical max non turbo ratio is more than the current max non turbo ratio), any request above current max non turbo is treated as full performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-28 01:01:39 +02:00
Srinivas Pandruvada	9522a2ff9c	cpufreq: intel_pstate: Enforce _PPC limits Use ACPI _PPC notification to limit max P state driver will request. ACPI _PPC change notification is sent by BIOS to limit max P state in several cases: - Reduce impact of platform thermal condition - When Config TDP feature is used, a changed _PPC is sent to follow TDP change - Remote node managers in server want to control platform power via baseboard management controller (BMC) This change registers with ACPI processor performance lib so that _PPC changes are notified to cpufreq core, which in turns will result in call to .setpolicy() callback. Also the way _PSS table identifies a turbo frequency is not compatible to max turbo frequency in intel_pstate, so the very first entry in _PSS needs to be adjusted. This feature can be turned on by using kernel parameters: intel_pstate=support_acpi_ppc Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-28 01:01:39 +02:00
Srinivas Pandruvada	1becf03545	cpufreq: intel_pstate: Fix processing for turbo activation ratio When the config TDP level is not nominal (level = 0), the MSR values for reading level 1 and level 2 ratios contain power in low 14 bits and actual ratio bits are at bits [23:16]. The current processing for level 1 and level 2 is wrong as there is no shift done to get actual ratio. Fixes: `6a35fc2d6c` (cpufreq: intel_pstate: get P1 from TAR when available) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-25 23:39:09 +02:00
Philippe Longepe	bdcaa23fb8	cpufreq: intel_pstate: Use average P-State instead of current P-State The result returned by pid_calc() is subtracted from current_pstate (which is the P-State requested during the last period) in order to obtain the target P-State for the current iteration. However, current_pstate may not reflect the real current P-State of the CPU. In particular, that P-State may be higher because of the frequency sharing per module. The theory is: - The load is the percentage of time spent in C0 and is related to the average P-State during the same period. - The last requested P-State can be completely different than the average P-State (because of frequency sharing or throttling). - The P-State shift computed by the pid_calc is based on the load computed at average P-State, so the shift must be relative to this average P-State. Using the average P-State instead of current P-State improves power without significant performance penalty in cases when a task migrates from one core to other core sharing frequency and voltage. Performance and power comparison with this patch on Cherry Trail platform using Android: Benchmark ?Perf ?Power FishTank 10.45% 3.1% SmartBench-Gaming -0.1% -10.4% SmartBench-Productivity -0.8% -10.4% CandyCrush n/a -17.4% AngryBirds n/a -5.9% videoPlayback n/a -13.9% audioPlayback n/a -4.9% IcyRocks-20-50 0.0% -38.4% iozone RR -0.16% -1.3% iozone RW 0.74% -1.3% Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-25 15:45:11 +02:00
Rafael J. Wysocki	1cbc99dfe5	Merge back cpufreq changes for v4.7.	2016-04-25 15:44:01 +02:00
Rafael J. Wysocki	ffb810563c	intel_pstate: Avoid getting stuck in high P-states when idle Jörg Otte reports that commit `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) caused the CPUs in his Haswell-based system to stay in the very high frequency region even if the system is completely idle. That turns out to be an existing problem in the intel_pstate driver's P-state selection algorithm for Core processors. Namely, all decisions made by that algorithm are based on the average frequency of the CPU between sampling events and on the P-state requested on the last invocation, so it may get stuck at a very hight frequency even if the utilization of the CPU is very low (in fact, it may get stuck in a inadequate P-state regardless of the CPU utilization). The only way to kick it out of that limbo is a sufficiently long idle period (3 times longer than the prescribed sampling interval), but if that doesn't happen often enough (eg. due to a timing change like after the above commit), the P-state of the CPU may be inadequate pretty much all the time. To address the most egregious manifestations of that issue, reset the core_busy value used to determine the next P-state to request if the utilization of the CPU, determined with the help of the MPERF feedback register and the TSC, is below 1%. Link: https://bugzilla.kernel.org/show_bug.cgi?id=115771 Reported-and-tested-by: Jörg Otte <jrg.otte@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-10 05:59:10 +02:00
Joe Perches	4836df173a	intel_pstate: Use pr_fmt Prefix the output using the more common kernel style. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-09 01:33:42 +02:00
Rafael J. Wysocki	22590efb98	intel_pstate: Avoid pointless FRAC_BITS shifts under div_fp() There are multiple places in intel_pstate where int_tofp() is applied to both arguments of div_fp(), but this is pointless, because int_tofp() simply shifts its argument to the left by FRAC_BITS which mathematically is equivalent to multuplication by 2^FRAC_BITS, so if this is done to both arguments of a division, the extra factors will cancel each other during that operation anyway. Drop the pointless int_tofp() applied to div_fp() arguments throughout the driver. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-09 01:25:58 +02:00
Rafael J. Wysocki	4b42fafc1c	Merge branch 'pm-cpufreq-sched' into pm-cpufreq	2016-04-09 01:08:02 +02:00
Srinivas Pandruvada	13ad7701f9	cpufreq: intel_pstate: Documenation for structures No code change. Only added kernel doc style comments for structures. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-05 03:39:05 +02:00
Srinivas Pandruvada	30a3915385	cpufreq: intel_pstate: fix inconsistency in setting policy limits When user sets performance policy using cpufreq interface, it is possible that because of policy->max limits, the actual performance is still limited. But the current implementation will silently switch the policy to powersave and start using powersave limits. If user modifies any limits using intel_pstate sysfs, this is actually changing powersave limits. The current implementation tracks limits under powersave and performance policy using two different variables. When policy->max is less than policy->cpuinfo.max_freq, only powersave limit variable is used. This fix causes the performance limits variable to be used always when the policy is performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-05 03:37:13 +02:00
Rafael J. Wysocki	0bed612be6	cpufreq: sched: Helpers to add and remove update_util hooks Replace the single helper for adding and removing cpufreq utilization update hooks, cpufreq_set_update_util_data(), with a pair of helpers, cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(), and modify the users of cpufreq_set_update_util_data() accordingly. With the new helpers, the code using them doesn't need to worry about the internals of struct update_util_data and in particular it doesn't need to worry about populating the func field in it properly upfront. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2016-04-02 01:08:43 +02:00
Rafael J. Wysocki	febce40feb	intel_pstate: Avoid extra invocation of intel_pstate_sample() The initialization of intel_pstate for a given CPU involves populating the fields of its struct cpudata that represent the previous sample, but currently that is done in a problematic way. Namely, intel_pstate_init_cpu() makes an extra call to intel_pstate_sample() so it reads the current register values that will be used to populate the "previous sample" record during the next invocation of intel_pstate_sample(). However, after commit `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) that doesn't work for last_sample_time, because the time value is passed to intel_pstate_sample() as an argument now. Passing 0 to it from intel_pstate_init_cpu() is problematic, because that causes cpu->last_sample_time == 0 to be visible in get_target_pstate_use_performance() (and hence the extra cpu->last_sample_time > 0 check in there) and effectively allows the first invocation of intel_pstate_sample() from intel_pstate_update_util() to happen immediately after the initialization which may lead to a significant "turn on" effect in the governor algorithm. To mitigate that issue, rework the initialization to avoid the extra intel_pstate_sample() call from intel_pstate_init_cpu(). Instead, make intel_pstate_sample() return false if it has been called with cpu->sample.time equal to zero, which will make intel_pstate_update_util() skip the sample in that case, and reset cpu->sample.time from intel_pstate_set_update_util_hook() to make the algorithm start properly every time the hook is set. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-04-02 01:06:21 +02:00
Rafael J. Wysocki	bb6ab52f2b	intel_pstate: Do not set utilization update hook too early The utilization update hook in the intel_pstate driver is set too early, as it only should be set after the policy has been fully initialized by the core. That may cause intel_pstate_update_util() to use incorrect data and put the CPUs into incorrect P-states as a result. To prevent that from happening, make intel_pstate_set_policy() set the utilization update hook instead of intel_pstate_init_cpu() so intel_pstate_update_util() only runs when all things have been initialized as appropriate. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-31 17:42:15 +02:00
Rafael J. Wysocki	fdfdb2b130	intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts After commit `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) wrmsrl_on_cpu() cannot be called in the intel_pstate_adjust_busy_pstate() path as that is executed with disabled interrupts. However, atom_set_pstate() called from there via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in smp_call_function_single(). The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is because intel_pstate_set_pstate() calling it is also invoked during the initialization and cleanup of the driver and in those cases it is not guaranteed to be run on the CPU that is being updated. However, in the case when intel_pstate_set_pstate() is called by intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update the register safely. Moreover, intel_pstate_set_pstate() already contains code that only is executed if the function is called by intel_pstate_adjust_busy_pstate() and there is a special argument passed to it because of that. To fix the problem at hand, rearrange the code taking the above observations into account. First, replace the ->set() callback in struct pstate_funcs with a ->get_val() one that will return the value to be written to the IA32_PERF_CTL MSR without updating the register. Second, split intel_pstate_set_pstate() into two functions, intel_pstate_update_pstate() to be called by intel_pstate_adjust_busy_pstate() that will contain all of the intel_pstate_set_pstate() code which only needs to be executed in that case and will use wrmsrl() to update the MSR (after obtaining the value to write to it from the ->get_val() callback), and intel_pstate_set_min_pstate() to be invoked during the initialization and cleanup that will set the P-state to the minimum one and will update the MSR using wrmsrl_on_cpu(). Finally, move the code shared between intel_pstate_update_pstate() and intel_pstate_set_min_pstate() to a new static inline function intel_pstate_record_pstate() and make them both call it. Of course, that unifies the handling of the IA32_PERF_CTL MSR writes between Atom and Core. Fixes: `a4675fbc4a` (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-20 00:37:09 +01:00
Rafael J. Wysocki	4fec7ad5f6	intel_pstate: Do not skip samples partially If the current value of MPERF or the current value of TSC is the same as the previous one, respectively, intel_pstate_sample() bails out early and skips the sample. However, intel_pstate_adjust_busy_pstate() is still called in that case which is not correct, so modify intel_pstate_sample() to return a bool value indicating whether or not the sample has been taken and use it to decide whether or not to call intel_pstate_adjust_busy_pstate(). While at it, remove redundant parentheses from the MPERF/TSC check in intel_pstate_sample(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-03-11 00:07:51 +01:00
Philippe Longepe	8fa520af50	intel_pstate: Remove freq calculation from intel_pstate_calc_busy() Use a helper function to compute the average pstate and call it only where it is needed (only when tracing or in intel_pstate_get). Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-11 00:04:58 +01:00
Philippe Longepe	7349ec0470	intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance() The cpu_load algorithm doesn't need to invoke intel_pstate_calc_busy(), so move that call from intel_pstate_sample() to get_target_pstate_use_performance(). Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-11 00:04:58 +01:00
Philippe Longepe	a158bed5dc	intel_pstate: Optimize calculation for max/min_perf_adj mul_fp(int_tofp(A), B) expands to: ((A << FRAC_BITS) * B) >> FRAC_BITS, so the same result can be obtained via simple multiplication A * B. Apply this observation to max_perf * limits->max_perf and max_perf * limits->min_perf in intel_pstate_get_min_max()." Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-11 00:04:58 +01:00
Philippe Longepe	b54a0dfd56	intel_pstate: Remove extra conversions in pid calculation pid->setpoint and pid->deadband can be initialized in fixed point, so we can avoid the int_tofp in pid_calc. Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-11 00:04:42 +01:00
Rafael J. Wysocki	a5acbfbd70	Merge branch 'pm-cpufreq-governor' into pm-cpufreq	2016-03-10 20:46:03 +01:00
Rafael J. Wysocki	08f511fd41	cpufreq: Reduce cpufreq_update_util() overhead a bit Use the observation that cpufreq_update_util() is only called by the scheduler with rq->lock held, so the callers of cpufreq_set_update_util_data() can use synchronize_sched() instead of synchronize_rcu() to wait for cpufreq_update_util() to complete. Moreover, if they are updated to do that, rcu_read_(un)lock() calls in cpufreq_update_util() might be replaced with rcu_read_(un)lock_sched(), respectively, but those aren't really necessary, because the scheduler calls that function from RCU-sched read-side critical sections already. In addition to that, if cpufreq_set_update_util_data() checks the func field in the struct update_util_data before setting the per-CPU pointer to it, the data->func check may be dropped from cpufreq_update_util() as well. Make the above changes to reduce the overhead from cpufreq_update_util() in the scheduler paths invoking it and to make the cleanup after removing its callbacks less heavy-weight somewhat. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2016-03-09 15:07:58 +01:00
Rafael J. Wysocki	a4675fbc4a	cpufreq: intel_pstate: Replace timers with utilization update callbacks Instead of using a per-CPU deferrable timer for utilization sampling and P-states adjustments, register a utilization update callback that will be invoked from the scheduler on utilization changes. The sampling rate is still the same as what was used for the deferrable timers, so the functional impact of this patch should not be significant. Based on an earlier patch from Srinivas Pandruvada. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>	2016-03-09 14:40:52 +01:00
Srinivas Pandruvada	f05c966585	cpufreq: intel_pstate: disable HWP notifications Disable HWP Interrupt notification before enabling HWP. Since we don't have HWP interrupt handling for possible performance interrupts, there is not much use of enabling HWP interrupts. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-02-26 22:15:38 +01:00
Srinivas Pandruvada	7791e4aa59	cpufreq: intel_pstate: Enable HWP by default If the processor supports HWP, enable it by default without checking for the cpu model. This will allow to enable HWP in all supported processors without driver change. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-02-26 22:15:38 +01:00
Viresh Kumar	41cfd64cf4	intel_pstate: Update frequencies of policy->cpus only from ->set_policy() The intel-pstate driver is using intel_pstate_hwp_set() from two separate paths, i.e. ->set_policy() callback and sysfs update path for the files present in /sys/devices/system/cpu/intel_pstate/ directory. While an update to the sysfs path applies to all the CPUs being managed by the driver (which essentially means all the online CPUs), the update via the ->set_policy() callback applies to a smaller group of CPUs managed by the policy for which ->set_policy() is called. And so, intel_pstate_hwp_set() should update frequencies of only the CPUs that are part of policy->cpus mask, while it is called from ->set_policy() callback. In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set() and apply the frequency changes only to the concerned CPUs. For ->set_policy() path, we are only concerned about policy->cpus, and so policy->rwsem lock taken by the core prior to calling ->set_policy() is enough to take care of any races. The larger lock acquired by get_online_cpus() is required only for the updates to sysfs files. Add another routine, intel_pstate_hwp_set_online_cpus(), and call it from the sysfs update paths. This also fixes a lockdep reported recently, where policy->rwsem and get_online_cpus() could have been acquired in any order causing an ABBA deadlock. The sequence of events leading to that was: intel_pstate_init(...) ...cpufreq_online(...) down_write(&policy->rwsem); // Locks policy->rwsem ... cpufreq_init_policy(policy); ...intel_pstate_hwp_set(); get_online_cpus(); // Temporarily locks cpu_hotplug.lock ... up_write(&policy->rwsem); pm_suspend(...) ...disable_nonboot_cpus() _cpu_down() cpu_hotplug_begin(); // Locks cpu_hotplug.lock __cpu_notify(CPU_DOWN_PREPARE, ...); ...cpufreq_offline_prepare(); down_write(&policy->rwsem); // Locks policy->rwsem Reported-and-tested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-02-23 01:10:44 +01:00
Rafael J. Wysocki	4157c2fc84	Merge back earlier cpufreq material for v4.5.	2015-12-21 03:15:15 +01:00
Prarit Bhargava	88b7b7c0c2	cpufreq: intel_pstate: Minor cleanup for FRAC_BITS `785ee27` ("cpufreq: intel_pstate: Fix limits->max_perf rounding error") hardcodes the value of FRAC_BITS. This patch fixes that minor issue. Fixes: `785ee27881` (cpufreq: intel_pstate: Fix limits->max_perf rounding error) Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-12-12 02:28:19 +01:00
Philippe Longepe	63d1d656a5	cpufreq: intel_pstate: Account for IO wait time In cases where we have many IOs, the global load becomes low and the load algorithm will decrease the requested P-State. Because of that, the IOs overheads will increase and impact the IO performances. To improve IO bound work, we can count the io-wait time as busy time in calculating CPU busy. This change uses get_cpu_iowait_time_us() to obtain the IO wait time value and converts time into number of cycles spent waiting on IO at the TSC rate. At the moment, this trick is only used for Atom. Signed-off-by: Philippe Longepe <philippe.longepe@intel.com> Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-12-10 01:17:40 +01:00
Philippe Longepe	e70eed2b64	cpufreq: intel_pstate: Account for non C0 time The current function to calculate cpu utilization uses the average P-state ratio (APerf/Mperf) scaled by the ratio of the current P-state to the max available non-turbo one. This leads to an overestimation of utilization which causes higher-performance P-states to be selected more often and that leads to increased energy consumption. This is a problem for low-power systems, so it is better to use a different utilization calculation algorithm for them. Namely, the Percent Busy value (or load) can be estimated as the ratio of the MPERF counter that runs at a constant rate only during active periods (C0) to the time stamp counter (TSC) that also runs (at the same rate) during idle. That is: Percent Busy = 100 * (delta_mperf / delta_tsc) Use this algorithm for platforms with SoCs based on the Airmont and Silvermont Atom cores. Signed-off-by: Philippe Longepe <philippe.longepe@intel.com> Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-12-10 01:17:40 +01:00
Philippe Longepe	157386b6fc	cpufreq: intel_pstate: Configurable algorithm to get target pstate Target systems using different cpus have different power and performance requirements. They may use different algorithms to get the next P-state based on their power or performance preference. For example, power-constrained systems may not want to use high-performance P-states as aggressively as a full-size desktop or a server platform. A server platform may want to run close to the max to achieve better performance, while laptop-like systems may prefer sacrificing performance for longer battery lifes. For the above reasons, modify intel_pstate to allow the target P-state selection algorithm to be depend on the CPU ID. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Philippe Longepe <philippe.longepe@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-12-10 01:17:40 +01:00
Alexandra Yates	584ee3dcb1	intel_pstate: Fix "performance" mode behavior with HWP enabled If hardware-driven P-state selection (HWP) is enabled, the "performance" mode of intel_pstate should only allow the processor to use the highest-performance P-state available. That is not the case currently, so make it actually happen. Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Alexandra Yates <alexandra.yates@linux.intel.com> [ rjw: Subject and changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-25 23:37:44 +01:00
Prarit Bhargava	785ee27881	cpufreq: intel_pstate: Fix limits->max_perf rounding error A rounding error was found in the calculation of limits->max_perf in intel_pstate_set_policy(), which is used to calculate the max and min pstate values in intel_pstate_get_min_max(). In that code, limits->max_perf is truncated to 2 hex digits such that, for example, 0x169 was incorrectly calculated to 0x16 instead of 0x17. This resulted in the pstate being set one level too low. This patch rounds the value of limits->max_perf up instead of down so that the correct max pstate can be reached. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-23 23:15:34 +01:00
Prarit Bhargava	8478f53946	cpufreq: intel_pstate: Fix limits->max_policy_pct rounding error I have a Intel (6,63) processor with a "marketing" frequency (from /proc/cpuinfo) of 2100MHz, and a max turbo frequency of 2600MHz. I can execute cpupower frequency-set -g powersave --min 1200MHz --max 2100MHz and the max_freq_pct is set to 80. When adding load to the system I noticed that the cpu frequency only reached 2000MHZ and not 2100MHz as expected. This is because limits->max_policy_pct is calculated as 2100 * 100 /2600 = 80.7 and is rounded down to 80 when it should be rounded up to 81. This patch adds a DIV_ROUND_UP() which will return the correct value. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-23 23:14:10 +01:00
Philippe Longepe	1421df63c3	cpufreq: intel_pstate: Add separate support for Airmont cores There are two flavors of Atom cores to be supported by intel_pstate, Silvermont and Airmont, so make the driver distinguish between them by adding separate frequency tables. Separate the CPU defaults params for each of them and match the CPU IDs against them as appropriate. Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Subject and changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-19 00:21:46 +01:00
Philippe Longepe	938d21a2a6	cpufreq: intel_pstate: Replace BYT with ATOM Rename symbol and function names starting with "BYT" or "byt" to start with "ATOM" or "atom", respectively, so as to make it clear that they may apply to Atom in general and not just to Baytrail (the goal is to support several Atoms architectures eventually). This should not lead to any functional changes. Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-19 00:21:46 +01:00
Rafael J. Wysocki	6ee11e413c	Revert "cpufreq: intel_pstate: Use ACPI perf configuration" Revert commit `37afb00032` (cpufreq: intel_pstate: Use ACPI perf configuration) that is reported to cause a regression to happen on a system where invalid data are returned by the ACPI _PSS object. Since that commit makes assumptions regarding the _PSS output correctness that may turn out to be overly optimistic in general, there is a concern that it may introduce regression on more systems, so it's better to revert it now and we'll revisit the underlying issue in the next cycle with a more robust solution. Conflicts: drivers/cpufreq/intel_pstate.c Fixes: `37afb00032` (cpufreq: intel_pstate: Use ACPI perf configuration) Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-19 00:20:42 +01:00
Rafael J. Wysocki	799281a3c4	Revert "cpufreq: intel_pstate: Avoid calculation for max/min" Revert commit `4ef4514870` (cpufreq: intel_pstate: Avoid calculation for max/min) as it depends on commit `37afb00032` (cpufreq: intel_pstate: Use ACPI perf configuration) that causes problems to happen and needs to be reverted. Conflicts: drivers/cpufreq/intel_pstate.c Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-18 23:29:56 +01:00
Prarit Bhargava	539342f60b	intel_pstate: decrease number of "HWP enabled" messages When booting an HWP enabled system the kernel displays one "HWP enabled" message for each cpu. The messages are superfluous since HWP is globally enabled across all CPUs. This patch also adds an informational message when HWP is disabled via intel_pstate=no_hwp. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-11-02 02:01:18 +01:00
Prarit Bhargava	51443fbf3d	cpufreq: intel_pstate: Fix intel_pstate powersave min_perf_pct value On systems that initialize the intel_pstate driver with the performance governor, and then switch to the powersave governor will not transition to lower cpu frequencies until /sys/devices/system/cpu/intel_pstate/min_perf_pct is set to a low value. The behavior of governor switching changed after commit `a04759924e` ("[cpufreq] intel_pstate: honor user space min_perf_pct override on resume"). The commit introduced tracking of performance percentage changes via sysfs in order to restore userspace changes during suspend/resume. The problem occurs because the global values of the newly introduced max_sysfs_pct and min_sysfs_pct are not lowered on the governor change and this causes the powersave governor to inherit the performance governor's settings. A simple change would have been to reset max_sysfs_pct to 100 and min_sysfs_pct to 0 on a governor change, which fixes the problem with governor switching. However, since we cannot break userspace[1] the fix is now to give each governor its own limits storage area so that governor specific changes are tracked. I successfully tested this by booting with both the performance governor and the powersave governor by default, and switching between the two governors (while monitoring /sys/devices/system/cpu/intel_pstate/ values, and looking at the output of cpupower frequency-info). Suspend/Resume testing was performed by Doug Smythies. [1] Systems which suspend/resume using the unmaintained pm-utils package will always transition to the performance governor before the suspend and after the resume. This means a system using the powersave governor will go from powersave to performance, then suspend/resume, performance to powersave. The simple change during governor changes would have been overwritten when the governor changed before and after the suspend/resume. I have submitted https://bugzilla.redhat.com/show_bug.cgi?id=1271225 against Fedora to remove the 94cpufreq file that causes the problem. It should be noted that pm-utils is obsoleted with newer versions of systemd. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-10-16 22:35:11 +02:00
Rafael J. Wysocki	7855e10294	Merge back earlier cpufreq material for v4.4.	2015-10-16 22:12:02 +02:00
Srinivas Pandruvada	8e601a9f97	cpufreq: intel_pstate: Fix divide by zero on Knights Landing (KNL) This is a workaround for KNL platform, where in some cases MPERF counter will not have updated value before next read of MSR_IA32_MPERF. In this case divide by zero will occur. This change ignores current sample for busy calculation in this case. Fixes: `b34ef932d7` (intel_pstate: Knights Landing support) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Cc: 4.1+ <stable@vger.kernel.org> # 4.1+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-10-15 22:46:33 +02:00
Srinivas Pandruvada	4ef4514870	cpufreq: intel_pstate: Avoid calculation for max/min When requested from cpufreq to set policy, look into _pss and get control values, instead of using max/min perf calculations. These calculation misses next control state in boundary conditions. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-10-15 01:53:19 +02:00
Srinivas Pandruvada	37afb00032	cpufreq: intel_pstate: Use ACPI perf configuration Use ACPI _PSS to limit the Intel P State turbo, max and min ratios. This driver uses acpi processor perf lib calls to register performance. The following logic is used to adjust Intel P state driver limits: - If there is no turbo entry in _PSS, then disable Intel P state turbo and limit to non turbo max - If the non turbo max ratio is more than _PSS max non turbo value, then set the max non turbo ratio to _PSS non turbo max - If the min ratio is less than _PSS min then change the min ratio matching _PSS min - Scale the _PSS turbo frequency to max turbo frequency based on control value. This feature can be disabled by using kernel parameters: intel_pstate=no_acpi Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-10-15 01:53:18 +02:00
Srinivas Pandruvada	3bcc6fa971	cpufreq: intel-pstate: Use separate max pstate for scaling Systems with configurable TDP have multiple max non turbo p state. Intel P state uses max non turbo P state for scaling. But using the real max non turbo p state causes underestimation of next P state. So using the physical max non turbo P state as before for scaling. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-10-15 01:53:18 +02:00
Srinivas Pandruvada	6a35fc2d6c	cpufreq: intel_pstate: get P1 from TAR when available After Ivybridge, the max non turbo ratio obtained from platform info msr is not always guaranteed P1 on client platforms. The max non turbo activation ratio (TAR), determines the max for the current level of TDP. The ratio in platform info is physical max. The TAR MSR can be locked, so updating this value is not possible on all platforms. This change gets this ratio from MSR TURBO_ACTIVATION_RATIO if available, but also do some sanity checking to make sure that this value is correct. The sanity check involves reading the TDP ratio for the current tdp control value when platform has configurable TDP present and matching TAC with this. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2015-10-15 01:53:18 +02:00

1 2 3 4 5 ...

369 Commits