summaryrefslogtreecommitdiff
path: root/drivers/cpufreq/intel_pstate.c
Commit message (Collapse)AuthorAgeFilesLines
* cpufreq: Use struct kobj_attribute instead of struct global_attrViresh Kumar2019-03-131-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 625c85a62cb7d3c79f6e16de3cfa972033658250 upstream. The cpufreq_global_kobject is created using kobject_create_and_add() helper, which assigns the kobj_type as dynamic_kobj_ktype and show/store routines are set to kobj_attr_show() and kobj_attr_store(). These routines pass struct kobj_attribute as an argument to the show/store callbacks. But all the cpufreq files created using the cpufreq_global_kobject expect the argument to be of type struct attribute. Things work fine currently as no one accesses the "attr" argument. We may not see issues even if the argument is used, as struct kobj_attribute has struct attribute as its first element and so they will both get same address. But this is logically incorrect and we should rather use struct kobj_attribute instead of struct global_attr in the cpufreq core and drivers and the show/store callbacks should take struct kobj_attribute as argument instead. This bug is caught using CFI CLANG builds in android kernel which catches mismatch in function prototypes for such callbacks. Reported-by: Donghee Han <dh.han@samsung.com> Reported-by: Sangkyu Kim <skwith.kim@samsung.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()Rafael J. Wysocki2017-10-081-0/+1
| | | | | | | | | | | | | | [ Upstream commit 6e7408acd04d06c04981c0c0fb5a2462b16fae4f ] Fix the debugfs interface for PID tuning to actually update pid_params.sample_rate_ns on PID parameters updates, as changing pid_params.sample_rate_ms via debugfs has no effect now. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cpufreq: intel_pstate: Disable energy efficiency optimizationSrinivas Pandruvada2017-02-141-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6e978b22efa1db9f6e71b24440b5f1d93e968ee3 upstream. Some Kabylake desktop processors may not reach max turbo when running in HWP mode, even if running under sustained 100% utilization. This occurs when the HWP.EPP (Energy Performance Preference) is set to "balance_power" (0x80) -- the default on most systems. It occurs because the platform BIOS may erroneously enable an energy-efficiency setting -- MSR_IA32_POWER_CTL BIT-EE, which is not recommended to be enabled on this SKU. On the failing systems, this BIOS issue was not discovered when the desktop motherboard was tested with Windows, because the BIOS also neglects to provide the ACPI/CPPC table, that Windows requires to enable HWP, and so Windows runs in legacy P-state mode, where this setting has no effect. Linux' intel_pstate driver does not require ACPI/CPPC to enable HWP, and so it runs in HWP mode, exposing this incorrect BIOS configuration. There are several ways to address this problem. First, Linux can also run in legacy P-state mode on this system. As intel_pstate is how Linux enables HWP, booting with "intel_pstate=disable" will run in acpi-cpufreq/ondemand legacy p-state mode. Or second, the "performance" governor can be used with intel_pstate, which will modify HWP.EPP to 0. Or third, starting in 4.10, the /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference attribute in can be updated from "balance_power" to "performance". Or fourth, apply this patch, which fixes the erroneous setting of MSR_IA32_POWER_CTL BIT_EE on this model, allowing the default configuration to function as designed. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cpufreq: intel_pstate: Always set max P-state in performance modeRafael J. Wysocki2016-10-241-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only times at which intel_pstate checks the policy set for a given CPU is the initialization of that CPU and updates of its policy settings from cpufreq when intel_pstate_set_policy() is invoked. That is insufficient, however, because intel_pstate uses the same P-state selection function for all CPUs regardless of the policy setting for each of them and the P-state limits are shared between them. Thus if the policy is set to "performance" for a particular CPU, it may not behave as expected if the cpufreq settings are changed subsequently for another CPU. That can be easily demonstrated by writing "performance" to scaling_governor for all CPUs and then switching it to "powersave" for one of them in which case all of the CPUs will behave as though their scaling_governor were all "powersave" (even though the policy still appears to be "performance" for the remaining CPUs). Fix this problem by modifying intel_pstate_adjust_busy_pstate() to always set the P-state to the maximum allowed by the current limits for all CPUs whose policy is set to "performance". Note that it still is recommended to always change the policy setting in the same way for all CPUs even with this fix applied to avoid confusion. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpufreq: intel_pstate: Set P-state upfront in performance modeRafael J. Wysocki2016-10-211-4/+25
| | | | | | | | | | | | | | | After commit a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) the cpufreq governor callbacks may not be invoked on NOHZ_FULL CPUs and, in particular, switching to the "performance" policy via sysfs may not have any effect on them. That is a problem, because it usually is desirable to squeeze the last bit of performance out of those CPUs, so work around it by setting the maximum P-state (within the limits) in intel_pstate_set_policy() upfront when the policy is CPUFREQ_POLICY_PERFORMANCE. Fixes: a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization update callbacks) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
* cpufreq: intel_pstate: Fix struct pstate_adjust_policy kerneldocRafael J. Wysocki2016-10-121-1/+1
| | | | | | | | It looks like the name of struct pstate_adjust_policy was updated without updating its kerneldoc comment accordingly, so fix that mistake. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpufreq: intel_pstate: Proportional algorithm for AtomRafael J. Wysocki2016-10-121-1/+21
| | | | | | | | | | | | | | | | | | The PID algorithm used by the intel_pstate driver tends to drive performance to the minimum for workloads with utilization below the setpoint, which is undesirable, so replace it with a modified "proportional" algorithm on Atom. The new algorithm will set the new P-state to be 1.25 times the available maximum times the (frequency-invariant) utilization during the previous sampling period except when the target P-state computed this way is lower than the average P-state during the previous sampling period. In the latter case, it will increase the target by 50% of the difference between it and the average P-state to prevent performance from dropping down too fast in some cases. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
* cpufreq: intel_pstate: Clarify comment in get_target_pstate_use_performance()Rafael J. Wysocki2016-10-091-4/+5
| | | | | | | Make the comment explaining the meaning of the perf_scaled variable in get_target_pstate_use_performance() more straightforward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpufreq: intel_pstate: Fix unsafe HWP MSR accessSrinivas Pandruvada2016-10-091-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a requirement that MSR MSR_PM_ENABLE must be set to 0x01 before reading MSR_HWP_CAPABILITIES on a given CPU. If cpufreq init() is scheduled on a CPU which is not same as policy->cpu or migrates to a different CPU before calling msr read for MSR_HWP_CAPABILITIES, it is possible that MSR_PM_ENABLE was not to set to 0x01 on that CPU. This will cause GP fault. So like other places in this path rdmsrl_on_cpu should be used instead of rdmsrl. Moreover the scope of MSR_HWP_CAPABILITIES is on per thread basis, so it should be read from the same CPU, for which MSR MSR_HWP_REQUEST is getting set. dmesg dump or warning: [ 22.014488] WARNING: CPU: 139 PID: 1 at arch/x86/mm/extable.c:50 ex_handler_rdmsr_unsafe+0x68/0x70 [ 22.014492] unchecked MSR access error: RDMSR from 0x771 [ 22.014493] Modules linked in: [ 22.014507] CPU: 139 PID: 1 Comm: swapper/0 Not tainted 4.7.5+ #1 ... ... [ 22.014516] Call Trace: [ 22.014542] [<ffffffff813d7dd1>] dump_stack+0x63/0x82 [ 22.014558] [<ffffffff8107bc8b>] __warn+0xcb/0xf0 [ 22.014561] [<ffffffff8107bcff>] warn_slowpath_fmt+0x4f/0x60 [ 22.014563] [<ffffffff810676f8>] ex_handler_rdmsr_unsafe+0x68/0x70 [ 22.014564] [<ffffffff810677d9>] fixup_exception+0x39/0x50 [ 22.014604] [<ffffffff8102e400>] do_general_protection+0x80/0x150 [ 22.014610] [<ffffffff817f9ec8>] general_protection+0x28/0x30 [ 22.014635] [<ffffffff81687940>] ? get_target_pstate_use_performance+0xb0/0xb0 [ 22.014642] [<ffffffff810600c7>] ? native_read_msr+0x7/0x40 [ 22.014657] [<ffffffff81688123>] intel_pstate_hwp_set+0x23/0x130 [ 22.014660] [<ffffffff81688406>] intel_pstate_set_policy+0x1b6/0x340 [ 22.014662] [<ffffffff816829bb>] cpufreq_set_policy+0xeb/0x2c0 [ 22.014664] [<ffffffff81682f39>] cpufreq_init_policy+0x79/0xe0 [ 22.014666] [<ffffffff81682cb0>] ? cpufreq_update_policy+0x120/0x120 [ 22.014669] [<ffffffff816833a6>] cpufreq_online+0x406/0x820 [ 22.014671] [<ffffffff8168381f>] cpufreq_add_dev+0x5f/0x90 [ 22.014717] [<ffffffff81530ac8>] subsys_interface_register+0xb8/0x100 [ 22.014719] [<ffffffff816821bc>] cpufreq_register_driver+0x14c/0x210 [ 22.014749] [<ffffffff81fe1d90>] intel_pstate_init+0x39d/0x4d5 [ 22.014751] [<ffffffff81fe13f2>] ? cpufreq_gov_dbs_init+0x12/0x12 Cc: 4.3+ <stable@vger.kernel.org> # 4.3+ Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* Merge branch 'pm-cpufreq-sched' into pm-cpufreqRafael J. Wysocki2016-10-021-29/+34
|\
| * cpufreq: intel_pstate: Add io_boost traceSrinivas Pandruvada2016-09-161-1/+2
| | | | | | | | | | | | | | Add io_boost percent to current pstate_sample tracepoint. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithmRafael J. Wysocki2016-09-141-27/+31
| | | | | | | | | | | | | | | | Modify the P-state selection algorithm for Atom processors to use the new SCHED_CPUFREQ_IOWAIT flag instead of the questionable get_cpu_iowait_time_us() function. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * cpufreq / sched: Pass flags to cpufreq_update_util()Rafael J. Wysocki2016-08-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is useful to know the reason why cpufreq_update_util() has just been called and that can be passed as flags to cpufreq_update_util() and to the ->func() callback in struct update_util_data. However, doing that in addition to passing the util and max arguments they already take would be clumsy, so avoid it. Instead, use the observation that the schedutil governor is part of the scheduler proper, so it can access scheduler data directly. This allows the util and max arguments of cpufreq_update_util() and the ->func() callback in struct update_util_data to be replaced with a flags one, but schedutil has to be modified to follow. Thus make the schedutil governor obtain the CFS utilization information from the scheduler and use the "RT" and "DL" flags instead of the special utilization value of ULONG_MAX to track updates from the RT and DL sched classes. Make it non-modular too to avoid having to export scheduler variables to modules at large. Next, update all of the other users of cpufreq_update_util() and the ->func() callback in struct update_util_data accordingly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* | intel_pstate: constify local structuresJulia Lawall2016-09-131-4/+4
|/ | | | | | | | | | | | | | | | | | | For structure types defined in the same file or local header files, find top-level static structure declarations that have the following properties: 1. Never reassigned. 2. Address never taken 3. Not passed to a top-level macro call 4. No pointer or array-typed field passed to a function or stored in a variable. Declare structures having all of these properties as const. Done using Coccinelle. Based on a suggestion by Joe Perches <joe@perches.com>. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*---. Merge branches 'pm-sleep', 'pm-cpufreq', 'pm-core' and 'pm-opp'Rafael J. Wysocki2016-08-051-0/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pm-sleep: x86/power/64: Do not refer to __PAGE_OFFSET from assembly code * pm-cpufreq: cpufreq: Do not default-yes CPU_FREQ_STAT cpufreq: intel_pstate: Add more out-of-band IDs * pm-core: PM-wakeup: Delete unnecessary checks before three function calls * pm-opp: PM / OPP: optimize dev_pm_opp_set_rate() performance a bit
| | * | cpufreq: intel_pstate: Add more out-of-band IDsSrinivas Pandruvada2016-07-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add Skylake-X and Broadwell-X IDs for out-of-band (OBB) control of P-States. For these processors, if MSR_MISC_PWR_MGMT BIT(8) == 1, then the Intel P-State driver should exit as OS can't control P-States. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Subject/changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | Merge branch 'pm-cpu'Rafael J. Wysocki2016-07-251-2/+2
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | * pm-cpu: x86: remove duplicate turbo ratio limit MSRs tools/power turbostat: Replace MSR_NHM_TURBO_RATIO_LIMIT cpufreq: intel_pstate: Replace MSR_NHM_TURBO_RATIO_LIMIT
| * | | cpufreq: intel_pstate: Replace MSR_NHM_TURBO_RATIO_LIMITSrinivas Pandruvada2016-07-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace MSR_NHM_TURBO_RATIO_LIMIT with MSR_TURBO_RATIO_LIMIT. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPTSrinivas Pandruvada2016-07-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MSR MSR_HWP_INTERRUPT is valid only when CPUID.06H:EAX[8] = 1, so check for feature before accessing this MSR. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | intel_pstate: Update cpu_frequency tracepoint every timeRafael J. Wysocki2016-07-211-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, intel_pstate only updates the cpu_frequency tracepoint if the new P-state to set is different from the current one, but that causes powertop to report 100% idle on an 100% loaded system sometimes. Prevent that from happening by updating the cpu_frequency tracepoint every time intel_pstate_update_pstate() is called. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>-
* | | | cpufreq: intel_pstate: clean remnant struct elementCarsten Emde2016-07-211-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I was working with the Intel P state driver I came across a remnant struct element that is no longer needed after the function intel_pstate_calc_freq() was retired. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()Jan Kiszka2016-07-111-1/+1
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If MSR_CONFIG_TDP_CONTROL is locked, we currently try to address some MSR 0x80000648 or so. Mask out the relevant level bits 0 and 1. Found while running over the Jailhouse hypervisor which became upset about this strange MSR index. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | Merge back earlier cpufreq material for v4.8.Rafael J. Wysocki2016-07-041-33/+55
|\ \ \ | |_|/ |/| |
| * | intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostlyJisheng Zhang2016-06-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pid_params is written once by copy_pid_params() during initialization, and thereafter is mostly read by hot path intel_pstate_update_util(). The read of pid_params gets more after commit a4675fbc4a7a ("cpufreq: intel_pstate: Replace timers with utilization update callbacks") pstate_funcs is written once by copy_cpu_funcs() during initialization, and thereafter is mostly read by hot path intel_pstate_update_util() hwp_active is written to once during initialization and thereafter is mostly read by hot path intel_pstate_update_util(). The fact that they are mostly read and not written to makes them candidates for __read_mostly declarations. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * | intel_pstate: add __init/__initdata marker to some functions/variablesJisheng Zhang2016-06-281-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions/variables are not needed after booting, so mark them as __init or __initdata. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * | intel_pstate: Fix incorrect placement of __initdataJisheng Zhang2016-06-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __initdata should be placed between the variable name and equal sign (if there is) for the variable to be placed in the intended section. Signed-off-by: Jisheng Zhang <jszhang@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * | Merge back earlier cpufreq changes for v4.8.Rafael J. Wysocki2016-06-201-18/+40
| |\ \ | | |/ | |/|
| | * cpufreq: intel_pstate: Broxton supportSrinivas Pandruvada2016-06-131-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add Broxton CPU model number. Broxton requires core_params to get performance limits via MSRs, but it is an Atom platform, which requires more power optimized algorithm. So the P state selection will use similar algorithm as other Atom platforms. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | * Merge branch 'x86/cpu' of ↵Rafael J. Wysocki2016-06-131-18/+19
| | |\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into x86/cpu Pull recent changes related to x86 CPU model representations from tip.
| | | * x86/cpufreq: Use Intel family name macros for the intel_pstate cpufreq driverDave Hansen2016-06-081-18/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another straightforward replacement of magic numbers. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: jacob.jun.pan@intel.com Cc: linux-pm@vger.kernel.org Link: http://lkml.kernel.org/r/20160603001945.0F5D02AA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | intel_pstate: Do not clear utilization update hooks on policy changesRafael J. Wysocki2016-06-271-2/+3
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | intel_pstate_set_policy() is invoked by the cpufreq core during driver initialization, on changes of policy attributes (minimim and maximum frequency, for example) via sysfs and via CPU notifications from the platform firmware. On some platforms the latter may occur relatively often. Commit bb6ab52f2bef (intel_pstate: Do not set utilization update hook too early) made intel_pstate_set_policy() clear the CPU's utilization update hook before updating the policy attributes for it (and set the hook again after doind that), but that involves invoking synchronize_sched() and adds overhead to the CPU notifications mentioned above and to the sched-RCU handling in general. That extra overhead is arguably not necessary, because updating policy attributes when the CPU's utilization update hook is active should not lead to any adverse effects, so drop the clearing of the hook from intel_pstate_set_policy() and make it check if the hook has been set already when attempting to set it. Fixes: bb6ab52f2bef (intel_pstate: Do not set utilization update hook too early) Reported-by: Jisheng Zhang <jszhang@marvell.com> Tested-by: Jisheng Zhang <jszhang@marvell.com> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | cpufreq: intel_pstate: Adjust _PSS[0] freqeuency if neededSrinivas Pandruvada2016-06-151-20/+2
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The maximum turbo P-State used by the intel_pstate driver may be limited by ACPI _PSS table entry 0. After commit 9522a2ff9cde (cpufreq: intel_pstate: Enforce _PPC limits), the maximum performance on servers will be capped by the _PSS table entry 0 by default. Even though that is formally correct, it may lead to preformance regressions in some cases. Namely, if the _PSS table entry 0 is not the maximum turbo P-State, performance measured after commit 9522a2ff9cde will not match the performance measured before that commit on the same system. For this reason, modify the code to always use the maximum turbo frequency as the one that corresponds to _PSS table entry 0 if turbo is enabled in the BIOS. This way, the performance levels from before commit 9522a2ff9cde will be restored on the affected systems. Fixes: 9522a2ff9cde (cpufreq: intel_pstate: Enforce _PPC limits) Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Fix ->set_policy() interface for no_turboSrinivas Pandruvada2016-06-081-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When turbo is disabled, the ->set_policy() interface is broken. For example, when turbo is disabled and cpuinfo.max = 2900000 (full max turbo frequency), setting the limits results in frequency less than the requested one: Set 1000000 KHz results in 0700000 KHz Set 1500000 KHz results in 1100000 KHz Set 2000000 KHz results in 1500000 KHz This is because the limits->max_perf fraction is calculated using the max turbo frequency as the reference, but when the max P-State is capped in intel_pstate_get_min_max(), the reference is not the max turbo P-State. This results in reducing max P-State. One option is to always use max turbo as reference for calculating limits. But this will not be correct. By definition the intel_pstate sysfs limits, shows percentage of available performance. So when BIOS has disabled turbo, the available performance is max non turbo. So the max_perf_pct should still show 100%. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Subject & changelog, rewrite in fewer lines of code ] Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Fix code ordering in intel_pstate_set_policy()Srinivas Pandruvada2016-06-081-1/+4
|/ | | | | | | | | | | | | | | | The limits->max_perf is rounded_up but immediately overwritten by another assignment to limits->max_perf. Move that operation to the correct location. While here also added a pr_debug() call in ->set_policy to aid in debugging. Fixes: 785ee2788141 (cpufreq: intel_pstate: Fix limits->max_perf rounding error) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Subject & changelog ] Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* cpufreq: intel_pstate: Downgrade print level for _PPCSrinivas Pandruvada2016-05-301-1/+1
| | | | | | | | | | | Downgrade pr_info to pr_debug for the "_PPC limits will be enforced" message. In server systems with many cores this message is annoying. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* intel_pstate: Simplify conditional in intel_pstate_set_policy()Rafael J. Wysocki2016-05-181-6/+5
| | | | | | | | | | | | One of the if () statements in intel_pstate_set_policy() causes another if () to be evaluated if the condition is true and it doesn't do anything else, so merge the two if () statements into one. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
* intel_pstate: Clean up get_target_pstate_use_performance()Rafael J. Wysocki2016-05-111-16/+11
| | | | | | | | | | The comments and the core_busy variable name in get_target_pstate_use_performance() are totally confusing, so modify them to reflect what's going on. The results of the computations should be the same as before. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* intel_pstate: Use sample.core_avg_perf in get_avg_pstate()Rafael J. Wysocki2016-05-111-2/+2
| | | | | | | Notice that get_avg_pstate() can use sample.core_avg_perf instead of carrying the same division again, so make it do that. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* intel_pstate: Clarify average performance computationRafael J. Wysocki2016-05-111-16/+24
| | | | | | | | | | | | | | | | | | | | The core_pct_busy field of struct sample actually contains the average performace during the last sampling period (in percent) and not the utilization of the core as suggested by its name which is confusing. For this reason, change the name of that field to core_avg_perf and rename the function that computes its value accordingly. Also notice that storing this value as percentage requires a costly integer multiplication to be carried out in a hot path, so instead store it as an "extended fixed point" value with more fraction bits and update the code using it accordingly (it is better to change the name of the field along with its meaning in one go than to make those two changes separately, as that would likely lead to more confusion). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* intel_pstate: Avoid unnecessary synchronize_sched() during initializationChen Yu2016-05-111-0/+9
| | | | | | | | | | | | | | | | | | Currently, in intel_pstate_clear_update_util_hook(), after clearing the utilization update hook, we leverage synchronize_sched() to deal with synchronization, which is a little bit time-costly because synchronize_sched() has to wait for all the CPUs to go through a grace period. Actually, the synchronize_sched() is not necessary if the utilization update hook has not been set for the given CPU yet, so make the driver check if that's the case and avoid the synchronize_sched() call then. Link: https://bugzilla.kernel.org/show_bug.cgi?id=116371 Tested-by: Tian Ye <yex.tian@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> [ rjw : Rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* intel_pstate: Clean up intel_pstate_get()Rafael J. Wysocki2016-05-091-7/+2
| | | | | | | | | intel_pstate_get() contains a local variable that's initialized but never used and it can be written in fewer lines of code, so clean it up. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
* Merge cpufreq fixes going into v4.6.Rafael J. Wysocki2016-05-061-8/+23
|\ | | | | | | | | | | | | | | * pm-cpufreq-fixes: intel_pstate: Fix intel_pstate_get() cpufreq: intel_pstate: Fix HWP on boot CPU after system resume cpufreq: st: enable selective initialization based on the platform cpufreq: intel_pstate: Fix processing for turbo activation ratio
| * intel_pstate: Fix intel_pstate_get()Rafael J. Wysocki2016-05-041-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 8fa520af5081 "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()" intel_pstate_get() calls get_avg_frequency() to compute the average frequency, which is problematic for two reasons. First, intel_pstate_get() may be invoked before the driver reads the CPU feedback registers for the first time and if that happens, get_avg_frequency() will attempt to divide by zero. Second, the get_avg_frequency() call in intel_pstate_get() is racy with respect to intel_pstate_sample() and it may end up returning completely meaningless values for this reason. Moreover, after commit 7349ec0470b6 "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()" sample.core_pct_busy is never computed on Atom, but it is used in intel_pstate_adjust_busy_pstate() in that case too. To address those problems notice that if sample.core_pct_busy was used in the average frequency computation carried out by get_avg_frequency(), both the divide by zero problem and the race with respect to intel_pstate_sample() would be avoided. Accordingly, move the invocation of intel_pstate_calc_busy() from get_target_pstate_use_performance() to intel_pstate_update_util(), which also will take care of the uninitialized sample.core_pct_busy on Atom, and modify get_avg_frequency() to use sample.core_pct_busy as per the above. Reported-by: kernel test robot <ying.huang@linux.intel.com> Link: http://marc.info/?l=linux-kernel&m=146226437623173&w=4 Fixes: 8fa520af5081 "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()" Fixes: 7349ec0470b6 "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()" Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| * cpufreq: intel_pstate: Fix HWP on boot CPU after system resumeRafael J. Wysocki2016-05-021-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 41cfd64cf49fc "Update frequencies of policy->cpus only from ->set_policy()" changed the way the intel_pstate driver's ->set_policy callback updates the HWP (hardware-managed P-states) settings. A side effect of it is that if those settings are modified on the boot CPU during system suspend and wakeup, they will never be restored during subsequent system resume. To address this problem, allow cpufreq drivers that don't provide ->target or ->target_index callbacks to use ->suspend and ->resume callbacks and add a ->resume callback to intel_pstate to restore the HWP settings on the CPUs that belong to the given policy. Fixes: 41cfd64cf49fc "Update frequencies of policy->cpus only from ->set_policy()" Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
| * cpufreq: intel_pstate: Fix processing for turbo activation ratioSrinivas Pandruvada2016-04-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | When the config TDP level is not nominal (level = 0), the MSR values for reading level 1 and level 2 ratios contain power in low 14 bits and actual ratio bits are at bits [23:16]. The current processing for level 1 and level 2 is wrong as there is no shift done to get actual ratio. Fixes: 6a35fc2d6c22 (cpufreq: intel_pstate: get P1 from TAR when available) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 4.4+ <stable@vger.kernel.org> # 4.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Ignore _PPC processing under HWPSrinivas Pandruvada2016-05-051-0/+3
| | | | | | | | | | | | | | | | When HWP (hardware P states) feature is active, the ACPI _PSS and _PPC is not used. So ignore processing for _PPC limits. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Enable PPC enforcement for serversSrinivas Pandruvada2016-04-281-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | For platforms which are controlled via remove node manager, enable _PPC by default. These platforms are mostly categorized as enterprise server or performance servers. These platforms needs to go through some certifications tests, which tests control via _PPC. The relative risk of enabling by default is low as this is is less likely that these systems have broken _PSS table. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Adjust policy->maxSrinivas Pandruvada2016-04-281-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When policy->max is changed via _PPC or sysfs and is more than the max non turbo frequency, it does not really change resulting performance in some processors. When policy->max results in a P-State ratio more than the turbo activation ratio, then processor can choose any P-State up to max turbo. So the user or _PPC setting has no value, but this can cause undesirable side effects like: - Showing reduced max percentage in Intel P-State sysfs - It can cause reduced max performance under certain boundary conditions: The requested max scaling frequency either via _PPC or via cpufreq-sysfs, will be converted into a fixed floating point max percent scale. In majority of the cases this will result in correct max. But not 100% of the time. If the _PPC is requested at a point where the calculation lead to a lower max, this can result in a lower P-State then expected and it will impact performance. Example of this condition using a Broadwell laptop with config TDP. ACPI _PSS table from a Broadwell laptop 2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000 1300000 1100000 1000000 900000 800000 600000 500000 The actual results by disabling config TDP so that we can get what is requested on or below 2300000Khz. scaling_max_freq Max Requested P-State Resultant scaling max ---------------------------------------- ---------------------- 2400000 18 2900000 (max turbo) 2300000 17 2300000 (max physical non turbo) 2200000 15 2100000 2100000 15 2100000 2000000 13 1900000 1900000 13 1900000 1800000 12 1800000 1700000 11 1700000 1600000 10 1600000 1500000 f 1500000 1400000 e 1400000 1300000 d 1300000 1200000 c 1200000 1100000 a 1000000 1000000 a 1000000 900000 9 900000 800000 8 800000 700000 7 700000 600000 6 600000 500000 5 500000 ------------------------------------------------------------------ Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz) in BIOS (not every system will let you adjust this). The turbo activation ratio will be set to one less than that, which will be 0x0a (So any request above 1000000KHz should result in turbo region assuming no thermal limits). Here _PPC will request max to 1100000KHz (which basically should still result in turbo as this is more than the turbo activation ratio up to max allowable turbo frequency), but actual calculation resulted in a max ceiling P-State which is 0x0a. So under any load condition, this driver will not request turbo P-States. This will be a huge performance hit. When config TDP feature is ON, if the _PPC points to a frequency above turbo activation ratio, the performance can still reach max turbo. In this case we don't need to treat this as the reduced frequency in set_policy callback. In this change when config TDP is active (by checking if the physical max non turbo ratio is more than the current max non turbo ratio), any request above current max non turbo is treated as full performance. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw : Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Enforce _PPC limitsSrinivas Pandruvada2016-04-281-2/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use ACPI _PPC notification to limit max P state driver will request. ACPI _PPC change notification is sent by BIOS to limit max P state in several cases: - Reduce impact of platform thermal condition - When Config TDP feature is used, a changed _PPC is sent to follow TDP change - Remote node managers in server want to control platform power via baseboard management controller (BMC) This change registers with ACPI processor performance lib so that _PPC changes are notified to cpufreq core, which in turns will result in call to .setpolicy() callback. Also the way _PSS table identifies a turbo frequency is not compatible to max turbo frequency in intel_pstate, so the very first entry in _PSS needs to be adjusted. This feature can be turned on by using kernel parameters: intel_pstate=support_acpi_ppc Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> [ rjw: Minor cleanups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | cpufreq: intel_pstate: Use average P-State instead of current P-StatePhilippe Longepe2016-04-251-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The result returned by pid_calc() is subtracted from current_pstate (which is the P-State requested during the last period) in order to obtain the target P-State for the current iteration. However, current_pstate may not reflect the real current P-State of the CPU. In particular, that P-State may be higher because of the frequency sharing per module. The theory is: - The load is the percentage of time spent in C0 and is related to the average P-State during the same period. - The last requested P-State can be completely different than the average P-State (because of frequency sharing or throttling). - The P-State shift computed by the pid_calc is based on the load computed at average P-State, so the shift must be relative to this average P-State. Using the average P-State instead of current P-State improves power without significant performance penalty in cases when a task migrates from one core to other core sharing frequency and voltage. Performance and power comparison with this patch on Cherry Trail platform using Android: Benchmark ?Perf ?Power FishTank 10.45% 3.1% SmartBench-Gaming -0.1% -10.4% SmartBench-Productivity -0.8% -10.4% CandyCrush n/a -17.4% AngryBirds n/a -5.9% videoPlayback n/a -13.9% audioPlayback n/a -4.9% IcyRocks-20-50 0.0% -38.4% iozone RR -0.16% -1.3% iozone RW 0.74% -1.3% Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>