Pull x86 microcode loader updates from Ingo Molnar:
"There are two main changes in this tree:
- AMD microcode early loading fixes
- some microcode loader source files reorganization"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode: Move to a proper location
x86, microcode, AMD: Fix early ucode loading
x86, microcode: Share native MSR accessing variants
x86, ramdisk: Export relocated ramdisk VA
Pull x86 EFI changes from Ingo Molnar:
"This consists of two main parts:
- New static EFI runtime services virtual mapping layout which is
groundwork for kexec support on EFI (Borislav Petkov)
- EFI kexec support itself (Dave Young)"
* 'x86-efi-kexec-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/efi: parse_efi_setup() build fix
x86: ksysfs.c build fix
x86/efi: Delete superfluous global variables
x86: Reserve setup_data ranges late after parsing memmap cmdline
x86: Export x86 boot_params to sysfs
x86: Add xloadflags bit for EFI runtime support on kexec
x86/efi: Pass necessary EFI data for kexec via setup_data
efi: Export EFI runtime memory mapping to sysfs
efi: Export more EFI table variables to sysfs
x86/efi: Cleanup efi_enter_virtual_mode() function
x86/efi: Fix off-by-one bug in EFI Boot Services reservation
x86/efi: Add a wrapper function efi_map_region_fixed()
x86/efi: Remove unused variables in __map_region()
x86/efi: Check krealloc return value
x86/efi: Runtime services virtual mapping
x86/mm/cpa: Map in an arbitrary pgd
x86/mm/pageattr: Add last levels of error path
x86/mm/pageattr: Add a PUD error unwinding path
x86/mm/pageattr: Add a PTE pagetable populating function
x86/mm/pageattr: Add a PMD pagetable populating function
...
Pull x86 TLB detection update from Ingo Molnar:
"A single change that extends our TLB cache size detection+reporting
code"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu: Detect more TLB configuration
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu, amd: Fix a shadowed variable situation
um, x86: Fix vDSO build
x86: Delete non-required instances of include <linux/init.h>
x86, realmode: Pointer walk cleanups, pull out invariant use of __pa()
x86/traps: Clean up error exception handler definitions
Pull scheduler changes from Ingo Molnar:
- Add the initial implementation of SCHED_DEADLINE support: a real-time
scheduling policy where tasks that meet their deadlines and
periodically execute their instances in less than their runtime quota
see real-time scheduling and won't miss any of their deadlines.
Tasks that go over their quota get delayed (Available to privileged
users for now)
- Clean up and fix preempt_enable_no_resched() abuse all around the
tree
- Do sched_clock() performance optimizations on x86 and elsewhere
- Fix and improve auto-NUMA balancing
- Fix and clean up the idle loop
- Apply various cleanups and fixes
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
sched: Fix __sched_setscheduler() nice test
sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
sched: Fix up attr::sched_priority warning
sched: Fix up scheduler syscall LTP fails
sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
sched/core: Fix htmldocs warnings
sched/deadline: No need to check p if dl_se is valid
sched/deadline: Remove unused variables
sched/deadline: Fix sparse static warnings
m68k: Fix build warning in mac_via.h
sched, thermal: Clean up preempt_enable_no_resched() abuse
sched, net: Fixup busy_loop_us_clock()
sched, net: Clean up preempt_enable_no_resched() abuse
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
sched/preempt, locking: Rework local_bh_{dis,en}able()
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
sched/clock: Fix up clear_sched_clock_stable()
sched/clock, x86: Use a static_key for sched_clock_stable
sched/clock: Remove local_irq_disable() from the clocks
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
...
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Add Intel RAPL energy counter support (Stephane Eranian)
- Clean up uprobes (Oleg Nesterov)
- Optimize ring-buffer writes (Peter Zijlstra)
Tooling side changes, user visible:
- 'perf diff':
- Add column colouring improvements (Ramkumar Ramachandra)
- 'perf kvm':
- Add guest related improvements, including allowing to specify a
directory with guest specific /proc information (Dongsheng Yang)
- Add shell completion support (Ramkumar Ramachandra)
- Add '-v' option (Dongsheng Yang)
- Support --guestmount (Dongsheng Yang)
- 'perf probe':
- Support showing source code, asking for variables to be collected
at probe time and other 'perf probe' operations that use DWARF
information.
This supports only binaries with debugging information at this
time, detached debuginfo (aka debuginfo packages) support should
come in later patches (Masami Hiramatsu)
- 'perf record':
- Rename --no-delay option to --no-buffering, better reflecting its
purpose and freeing up '--delay' to take the place of
'--initial-delay', so that 'record' and 'stat' are consistent
(Arnaldo Carvalho de Melo)
- Default the -t/--thread option to no inheritance (Adrian Hunter)
- Make per-cpu mmaps the default (Adrian Hunter)
- 'perf report':
- Improve callchain processing performance (Frederic Weisbecker)
- Retain bfd reference to lookup source line numbers, greatly
optimizing, among other use cases, 'perf report -s srcline'
(Adrian Hunter)
- Improve callchain processing performance even more (Namhyung Kim)
- Add a perf.data file header window in the 'perf report' TUI,
associated with the 'i' hotkey, providing a counterpart to the
--header option in the stdio UI (Namhyung Kim)
- 'perf script':
- Add an option in 'perf script' to print the source line number
(Adrian Hunter)
- Add --header/--header-only options to 'script' and 'report', the
default is not tho show the header info, but as this has been the
default for some time, leave a single line explaining how to
obtain that information (Jiri Olsa)
- Add options to show comm, fork, exit and mmap PERF_RECORD_ events
(Namhyung Kim)
- Print callchains and symbols if they exist (David Ahern)
- 'perf timechart'
- Add backtrace support to CPU info
- Print pid along the name
- Add support for CPU topology
- Add new option --highlight'ing threads, be it by name or, if a
numeric value is provided, that run more than given duration
(Stanislav Fomichev)
- 'perf top':
- Make 'perf top -g' refer to callchains, for consistency with
other tools (David Ahern)
- 'perf trace':
- Handle old kernels where the "raw_syscalls" tracepoints were
called plain "syscalls" (David Ahern)
- Remove thread summary coloring, by Pekka Enberg.
- Honour -m option in 'trace', the tool was offering the option to
set the mmap size, but wasn't using it when doing the actual mmap
on the events file descriptors (Jiri Olsa)
- generic:
- Backport libtraceevent plugin support (trace-cmd repository, with
plugins for jbd2, hrtimer, kmem, kvm, mac80211, sched_switch,
function, xen, scsi, cfg80211 (Jiri Olsa)
- Print session information only if --stdio is given (Namhyung Kim)
Tooling side changes, developer visible (plumbing):
- Improve 'perf probe' exit path, release resources (Masami
Hiramatsu)
- Improve libtraceevent plugins exit path, allowing the registering
of an unregister handler to be called at exit time (Namhyung Kim)
- Add an alias to the build test makefile (make -C tools/perf
build-test) (Namhyung Kim)
- Get rid of die() and friends (good riddance!) in libtraceevent
(Namhyung Kim)
- Fix cross build problems related to pkgconfig and CROSS_COMPILE not
being propagated to the feature tests, leading to features being
tested in the host and then being enabled on the target (Mark
Rutland)
- Improve forked workload error reporting by sending the errno in the
signal data queueing integer field, using sigqueue and by doing the
signal setup in the evlist methods, removing open coded equivalents
in various tools (Arnaldo Carvalho de Melo)
- Do more auto exit cleanup chores in the 'evlist' destructor, so
that the tools don't have to all do that sequence (Arnaldo Carvalho
de Melo)
- Pack 'struct perf_session_env' and 'struct trace' (Arnaldo Carvalho
de Melo)
- Add test for building detached source tarballs (Arnaldo Carvalho de
Melo)
- Move some header files (tools/perf/ to tools/include/ to make them
available to other tools/ dwelling codebases (Namhyung Kim)
- Move logic to warn about kptr_restrict'ed kernels to separate
function in 'report' (Arnaldo Carvalho de Melo)
- Move hist browser selection code to separate function (Arnaldo
Carvalho de Melo)
- Move histogram entries collapsing to separate function (Arnaldo
Carvalho de Melo)
- Introduce evlist__for_each() & friends (Arnaldo Carvalho de Melo)
- Automate setup of FEATURE_CHECK_(C|LD)FLAGS-all variables (Jiri
Olsa)
- Move arch setup into seprate Makefile (Jiri Olsa)
- Make libtraceevent install target quieter (Jiri Olsa)
- Make tests/make output more compact (Jiri Olsa)
- Ignore generated files in feature-checks (Chunwei Chen)
- Introduce pevent_filter_strerror() in libtraceevent, similar in
purpose to libc's strerror() function (Namhyung Kim)
- Use perf_data_file methods to write output file in 'record' and
'inject' (Jiri Olsa)
- Use pr_*() functions where applicable in 'report' (Namhyumg Kim)
- Add 'machine' 'addr_location' struct to have full picture (machine,
thread, map, symbol, addr) for a (partially) resolved address,
reducing function signatures (Arnaldo Carvalho de Melo)
- Reduce code duplication in the histogram entry creation/insertion
(Arnaldo Carvalho de Melo)
- Auto allocate annotation histogram data structures (Arnaldo
Carvalho de Melo)
- No need to test against NULL before calling free, also set freed
memory in struct pointers to NULL, to help fixing use after free
bugs (Arnaldo Carvalho de Melo)
- Rename some struct DSO binary_type related members and methods, to
clarify its purpose and need for differentiation (symtab_type, ie
one is about the files .text, CFI, etc, i.e. its binary contents,
and the other is about where the symbol table came from (Arnaldo
Carvalho de Melo)
- Convert to new topic libraries, starting with an API one (sysfs,
debugfs, etc), renaming liblk in the process (Borislav Petkov)
- Get rid of some more panic() like error handling in libtraceevent.
(Namhyung Kim)
- Get rid of panic() like calls in libtraceevent (Namyung Kim)
- Start carving out symbol parsing routines (perf, just moving
routines to topic files in tools/lib/symbol/, tools that want to
use it need to integrate it directly, ie no
tools/lib/symbol/Makefile is provided (Arnaldo Carvalho de Melo)
- Assorted refactoring patches, moving code around and adding utility
evlist methods that will be used in the IPT patchset (Adrian
Hunter)
- Assorted mmap_pages handling fixes (Adrian Hunter)
- Several man pages typo fixes (Dongsheng Yang)
- Get rid of several die() calls in libtraceevent (Namhyung Kim)
- Use basename() in a more robust way, to avoid problems related to
different system library implementations for that function
(Stephane Eranian)
- Remove open coded management of short_name_allocated member (Adrian
Hunter)
- Several cleanups in the "dso" methods, constifying some parameters
and renaming some fields to clarify its purpose (Arnaldo Carvalho
de Melo)
- Add per-feature check flags, fixing libunwind related build
problems on some architectures (Jean Pihet)
- Do not disable source line lookup just because of one failure.
(Adrian Hunter)
- Several 'perf kvm' man page corrections (Dongsheng Yang)
- Correct the message in feature-libnuma checking, swowing the right
devel package names for various distros (Dongsheng Yang)
- Polish 'readn()' function and introduce its counterpart,
'writen()' (Jiri Olsa)
- Start moving timechart state from global variables to a 'perf_tool'
derived 'timechart' struct (Arnaldo Carvalho de Melo)
... and lots of fixes and improvements I forgot to list"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
perf tools: Remove unnecessary callchain cursor state restore on unmatch
perf callchain: Spare double comparison of callchain first entry
perf tools: Do proper comm override error handling
perf symbols: Export elf_section_by_name and reuse
perf probe: Release all dynamically allocated parameters
perf probe: Release allocated probe_trace_event if failed
perf tools: Add 'build-test' make target
tools lib traceevent: Unregister handler when xen plugin is unloaded
tools lib traceevent: Unregister handler when scsi plugin is unloaded
tools lib traceevent: Unregister handler when jbd2 plugin is is unloaded
tools lib traceevent: Unregister handler when cfg80211 plugin is unloaded
tools lib traceevent: Unregister handler when mac80211 plugin is unloaded
tools lib traceevent: Unregister handler when sched_switch plugin is unloaded
tools lib traceevent: Unregister handler when kvm plugin is unloaded
tools lib traceevent: Unregister handler when kmem plugin is unloaded
tools lib traceevent: Unregister handler when hrtimer plugin is unloaded
tools lib traceevent: Unregister handler when function plugin is unloaded
tools lib traceevent: Add pevent_unregister_print_function()
tools lib traceevent: Add pevent_unregister_event_handler()
tools lib traceevent: fix pointer-integer size mismatch
...
Pull IRQ changes from Ingo Molnar:
"The only change in this cycle is a CPU hotplug related spurious
warning fix"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Fix kbuild warning in smp_irq_move_cleanup_interrupt()
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs
On AMD family 10h we see following error messages while waking up from
S3 for all non-boot CPUs leading to a failed IBS initialization:
Enabling non-boot CPUs ...
smpboot: Booting Node 0 Processor 1 APIC 0x1
[Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu
perf: IBS APIC setup failed on cpu #1
process: Switch to broadcast mode on CPU1
CPU1 is up
...
ACPI: Waking up from system sleep state S3
Reason for this is that during suspend the LVT offset for the IBS
vector gets lost and needs to be reinialized while resuming.
The offset is read from the IBSCTL msr. On family 10h the offset needs
to be 1 as offset 0 is used for the MCE threshold interrupt, but
firmware assings it for IBS to 0 too. The kernel needs to reprogram
the vector. The msr is a readonly node msr, but a new value can be
written via pci config space access. The reinitialization is
implemented for family 10h in setup_ibs_ctl() which is forced during
IBS setup.
This patch fixes IBS setup after waking up from S3 by adding
resume/supend hooks for the boot cpu which does the offset
reinitialization.
Marking it as stable to let distros pick up this fix.
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> v3.2..
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make disabled_cpu_apicid static and read_mostly, and fix a couple of
typos.
Reported-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/20140115182511.GA22737@gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Add disable_cpu_apicid kernel parameter. To use this kernel parameter,
specify an initial APIC ID of the corresponding CPU you want to
disable.
This is mostly used for the kdump 2nd kernel to disable BSP to wake up
multiple CPUs without causing system reset or hang due to sending INIT
from AP to BSP.
Kdump users first figure out initial APIC ID of the BSP, CPU0 in the
1st kernel, for example from /proc/cpuinfo and then set up this kernel
parameter for the 2nd kernel using the obtained APIC ID.
However, doing this procedure at each boot time manually is awkward,
which should be automatically done by user-land service scripts, for
example, kexec-tools on fedora/RHEL distributions.
This design is more flexible than disabling BSP in kernel boot time
automatically in that in kernel boot time we have no choice but
referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning
that the method is not applicable to the systems without such BIOS
tables.
One assumption behind this design is that users get initial APIC ID of
the BSP in still healthy state and so BSP is uniquely kept in
CPU0. Thus, through the kernel parameter, only one initial APIC ID can
be specified.
In a comparison with disabled_cpu_apicid, we use read_apic_id(), not
boot_cpu_physical_apicid, because on some platforms, the variable is
modified to the apicid reported as BSP through MP table and this
function is executed with the temporarily modified
boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel
parameter doesn't work well for apicids of APs.
Fixing the wrong handling of boot_cpu_physical_apicid requires some
reviews and tests beyond some platforms and it could take some
time. The fix here is a kind of workaround to focus on the main topic
of this patch.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Having u32 and struct cpuinfo_x86 * by the same name is not very smart,
although it was ok in this case due to the limited scope of u32 c and it
being used only once in there.
Fix this.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1389786735-16751-1-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Currently we do a read, a dummy write and a final read to fetch
the error code. The value from the final read is taken.
This is not the recommended way and leads to corrupted/lost ESR
values.
Intel(c) 64 and IA-32 Architectures Software Developer's Manual,
Combined Volumes 1, 2ABC, 3ABC, Section 10.5.3 states:
Before attempt to read from the ESR, software should first
write to it. (The value written does not affect the values read
subsequently; only zero may be written in x2APIC mode.) This
write clears any previously logged errors and updates the ESR
with any errors detected since the last write to the ESR.
This write also rearms the APIC error interrupt triggering
mechanism.
This patch removes the first read such that we are conform with
the manual.
On my (very old) Pentium MMX SMP system this patch fixes the
issue that APIC errors:
a) are not always reported and
b) are reported with false error numbers.
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: seiji.aguchi@hds.com
Cc: rientjes@google.com
Cc: konrad.wilk@oracle.com
Cc: bp@alien8.de
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1389685487-20872-1-git-send-email-richard@nod.at
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We've grown a bunch of microcode loader files all prefixed with
"microcode_". They should be under cpu/ because this is strictly
CPU-related functionality so do that and drop the prefix since they're
in their own directory now which gives that prefix. :)
While at it, drop MICROCODE_INTEL_LIB config item and stash the
functionality under CONFIG_MICROCODE_INTEL as it was its only user.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
The original idea to use the microcode cache for the APs doesn't pan out
because we do memory allocation there very early and with IRQs disabled
and we don't want to involve GFP_ATOMIC allocations. Not if it can be
helped.
Thus, extend the caching of the BSP patch approach to the APs and
iterate over the ucode in the initrd instead of using the cache. We
still save the relevant patches to it but later, right before we
jettison the initrd.
While at it, fix early ucode loading on 32-bit too.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
We want to use those in AMD's early loading path too. Also, add a
native_wrmsrl variant.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
The ramdisk can possibly get relocated if the whole image is not mapped.
And since we're going over it in the microcode loader and fishing out
the relevant microcode patches, we want access it at its new location.
Thus, export it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Use a ring-buffer like multi-version object structure which allows
always having a coherent object; we use this to avoid having to
disable IRQs while reading sched_clock() and avoids a problem when
getting an NMI while changing the cyc2ns data.
MAINLINE PRE POST
sched_clock_stable: 1 1 1
(cold) sched_clock: 329841 331312 257223
(cold) local_clock: 301773 310296 309889
(warm) sched_clock: 38375 38247 25280
(warm) local_clock: 100371 102713 85268
(warm) rdtsc: 27340 27289 24247
sched_clock_stable: 0 0 0
(cold) sched_clock: 382634 372706 301224
(cold) local_clock: 396890 399275 399870
(warm) sched_clock: 38194 38124 25630
(warm) local_clock: 143452 148698 129629
(warm) rdtsc: 27345 27365 24307
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fengguang Wu's 0day kernel build service reported the following build warning:
arch/x86/kernel/apic/io_apic.c:2211
smp_irq_move_cleanup_interrupt() warn: always true condition '(irq <= -1) => (0-u32max <= (-1))'
because irq is defined as an unsigned int instead of an int.
Fix this trivial error by redefining irq as a signed int. The
remaining consumers of the int are okay.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/1389620420-7110-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are no __cycles_2_ns() users outside of arch/x86/kernel/tsc.c,
so move it there.
There are no cycles_2_ns() users.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-01lslnavfgo3kmbo4532zlcj@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During heavy CPU-hotplug operations the following spurious kernel warnings
can trigger:
do_IRQ: No ... irq handler for vector (irq -1)
[ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ]
When downing a cpu it is possible that there are unhandled irqs
left in the APIC IRR register. The following code path shows
how the problem can occur:
1. CPU 5 is to go down.
2. cpu_disable() on CPU 5 executes with interrupt flag cleared
by local_irq_save() via stop_machine().
3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because
interrupt flag is cleared (CPU unabled to handle the irq)
4. IRQs are migrated off of CPU 5, and the vectors' irqs are set
to -1. 5. stop_machine() finishes cpu_disable()
6. cpu_die() for CPU 5 executes in normal context.
7. CPU 5 attempts to handle IRQ 12 because the IRR is set for
IRQ 12. The code attempts to find the vector's IRQ and cannot
because it has been set to -1. 8. do_IRQ() warning displays
warning about CPU 5 IRQ 12.
I added a debug printk to output which CPU & vector was
retriggered and discovered that that we are getting bogus
events. I see a 100% correlation between this debug printk in
fixup_irqs() and the do_IRQ() warning.
This patchset resolves this by adding definitions for
VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying
the code to use them.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Rui Wang <rui.y.wang@intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: janet.morgan@Intel.com
Cc: tony.luck@Intel.com
Cc: ruiv.wang@gmail.com
Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com
[ Cleaned up the code a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds support for the Intel RAPL energy counter
PP1 (Power Plane 1).
On client processors, it usually corresponds to the
energy consumption of the builtin graphic card. That
is why the sysfs event is called energy-gpu.
New event:
- name: power/energy-gpu/
- code: event=0x4
- unit: 2^-32 Joules
On processors without graphics, this should count 0.
The patch only enables this event on client processors.
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: vincent.weaver@maine.edu
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389176153-3128-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Function tracing callbacks expect to have the ftrace_ops that registered it
passed to them, not the address of the variable that holds the ftrace_ops
that registered it.
Use a mov instead of a lea to store the ftrace_ops into the parameter
of the function tracing callback.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20131113152004.459787f9@gandalf.local.home
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.8+
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
[ hpa: undid incorrect removal from arch/x86/kernel/head_32.S ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1389054026-12947-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
kbuild test robot report below error for randconfig:
arch/x86/kernel/ksysfs.c: In function 'get_setup_data_paddr':
arch/x86/kernel/ksysfs.c:81:3: error: implicit declaration of function 'ioremap_cache' [-Werror=implicit-function-declaration]
arch/x86/kernel/ksysfs.c:86:3: error: implicit declaration of function 'iounmap' [-Werror=implicit-function-declaration]
Fix it by including <asm/io.h> in ksysfs.c
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Pull x86 fixes from Peter Anvin:
"There is a small EFI fix and a big power regression fix in this batch.
My queue also had a fix for downing a CPU when there are insufficient
number of IRQ vectors available, but I'm holding that one for now due
to recent bug reports"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Don't select EFI from certain special ACPI drivers
x86 idle: Repair large-server 50-watt idle-power regression
Currently e820_reserve_setup_data() is called before parsing early
params, it works in normal case. But for memmap=exactmap, the final
memory ranges are created after parsing memmap= cmdline params, so the
previous e820_reserve_setup_data() has no effect. For example,
setup_data ranges will still be marked as normal system ram, thus when
later sysfs driver ioremap them kernel will warn about mapping normal
ram.
This patch fix it by moving the e820_reserve_setup_data() callback after
parsing early params so they can be set as reserved ranges and later
ioremap will be fine with it.
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
kexec-tools use boot_params for getting the 1st kernel hardware_subarch,
the kexec kernel EFI runtime support also needs to read the old efi_info
from boot_params. Currently it exists in debugfs which is not a good
place for such infomation. Per HPA, we should avoid "sploit debugfs".
In this patch /sys/kernel/boot_params are exported, also the setup_data is
exported as a subdirectory. kexec-tools is using debugfs for hardware_subarch
for a long time now so we're not removing it yet.
Structure is like below:
/sys/kernel/boot_params
|__ data /* boot_params in binary*/
|__ setup_data
| |__ 0 /* the first setup_data node */
| | |__ data /* setup_data node 0 in binary*/
| | |__ type /* setup_data type of setup_data node 0, hex string */
[snip]
|__ version /* boot protocal version (in hex, "0x" prefixed)*/
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Add a new setup_data type SETUP_EFI for kexec use. Passing the saved
fw_vendor, runtime, config tables and EFI runtime mappings.
When entering virtual mode, directly mapping the EFI runtime regions
which we passed in previously. And skip the step to call
SetVirtualAddressMap().
Specially for HP z420 workstation we need save the smbios physical
address. The kernel boot sequence proceeds in the following order.
Step 2 requires efi.smbios to be the physical address. However, I found
that on HP z420 EFI system table has a virtual address of SMBIOS in step
1. Hence, we need set it back to the physical address with the smbios
in efi_setup_data. (When it is still the physical address, it simply
sets the same value.)
1. efi_init() - Set efi.smbios from EFI system table
2. dmi_scan_machine() - Temporary map efi.smbios to access SMBIOS table
3. efi_enter_virtual_mode() - Map EFI ranges
Tested on ovmf+qemu, lenovo thinkpad, a dell laptop and an
HP z420 workstation.
Signed-off-by: Dave Young <dyoung@redhat.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
For consistency with mwait_idle_with_hints(). Not sure they help, but
they really won't hurt...
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com
People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.
This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.
Also remove __sti_mwait() as its unused.
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jacob Jun Pan <jacob.jun.pan@linux.intel.com>
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Rui Zhang <rui.zhang@intel.com>
Acked-by: Rafael Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131212141654.616820819@infradead.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Linux 3.10 changed the timing of how thread_info->flags is touched:
x86: Use generic idle loop
(7d1a941731)
This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.
Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.
Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.
While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.
Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
So I was reading the exception handler generation code and got a real
headache looking at the unstructured mess that our DO_ERROR*()
generation code is today.
Make it more readable.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-kuabysiykvUJpgus35lhnhvs@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The EVENT_CONSTRAINT_END() macro defines the end marker as
a constraint with a weight of zero. This was all fine
until we blacklisted the corrupting memory events on
Intel IvyBridge. These events are blacklisted by using
a counter bitmask of zero. Thus, they also get a constraint
weight of zero.
The iteration macro: for_each_constraint tests the weight==0.
Therefore, it was stopping at the first blacklisted event, i.e.,
0xd0. The corrupting events were therefore considered as
unconstrained and were scheduled on any of the generic counters.
This patch fixes the end marker to have a weight of -1. With
this, the blacklisted events get an empty constraint and cannot
be scheduled which is what we want for now.
Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Link: http://lkml.kernel.org/r/20131204232437.GA10689@starlight
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since erratum AVR31 in "Intel Atom Processor C2000 Product Family
Specification Update" is now published, I added a justification
comment for disabling IO APIC before Local APIC, as changed in commit:
522e664644 x86/apic: Disable I/O APIC before shutdown of the local APIC
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1386202069-51515-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.
The timer interval is calculated at boot time
based on the power unit used by the HW.
There is one hrtimer per-cpu to handle the case
of multiple simultaneous use across cores on
the same package + hotplug CPU.
Thanks to Maria Dimakopoulou for her contributions
to this patch especially on the math aspects.
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
[ Applied 32-bit build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1384275531-10892-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.
The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.
The following events are available and exposed in sysfs:
- power/energy-cores: power consumption of all cores on socket
- power/energy-pkg: power consumption of all cores + LLc cache
- power/energy-dram: power consumption of DRAM (servers only)
For each event both the unit (Joules) and scale (2^-32 J)
is exposed in sysfs for use by perf stat and other tools.
The files are:
/sys/devices/power/events/energy-*.unit
/sys/devices/power/events/energy-*.scale
The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/power/cpumask
file can be used by tools to figure out which CPUs to monitor
by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.
All the counters measure in the same unit (exposed via sysfs).
The perf_events API exposes all RAPL counters as 64-bit integers
counting in unit of 1/2^32 Joules (about 0.23 nJ). User level tools
must convert the counts by multiplying them by 2^-32 to obtain
Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point usage. There is no loss of
precision. Thanks to PeterZ for suggesting this approach.
To convert the raw count in Watt:
W = C * 2.3 / (1e10 * time)
or ldexp(C, -32).
RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/power/type.
Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/1384275531-10892-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Just a small pile of fixes for bugs and a few regressions. I'm still
trying to track down a driver load hang on my g33 (which infuriatingly
doesn't happen when loading the module manually after boot), somehow
bisecting loves to go astray on this one :( And there's a (harmless)
locking WARN in the suspend code due to one of Jesse's vlv backlight
rework patches. Otherwise nothing outstanding afaik.
* tag 'drm-intel-fixes-2013-11-20' of git://people.freedesktop.org/~danvet/drm-intel:
drm/i915: Fix gen3 self-refresh watermarks
drm/i915: Replicate BIOS eDP bpp clamping hack for hsw
drm/i915: Do not enable package C8 on unsupported hardware
drm/i915: Hold pc8 lock around toggling pc8.gpu_idle
drm/i915: encoder->get_config is no longer optional
drm/i915/tv: add ->get_config callback
drm/i915: restore the early forcewake cleanup
Partially revert "drm/i915: tune the RC6 threshold for stability"
drm/i915: flush cursors harder
i915: Use 120MHz LVDS SSC clock for gen5/gen6/gen7
x86/early quirk: use gen6 stolen detection for VLV
drm/i915/dp: set sink to power down mode on dp disable
Pull x86 fix from Ingo Molnar:
"A modular build fix for certain .config's"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Export 'boot_cpu_physical_apicid' to modules
The only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the function
graph tracer and not trace particular functions and their call chain.
Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.
One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included
all the changes to get that working, and I pulled that into my tree
in order to complete the perf function tracing fix.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSgX5SAAoJEKQekfcNnQGulUAH/jORqJrKaNAulmZ314VsAqfa
zMtF5UAAPf7kqc3AN/jtFrhJUNEfxWOo7A4r0FsM/rKdWJF+98GA6aqYVD+XoWFt
+36fg1enxbXUjixQ96Uh+o1+BJUgYDqljuWzqSu/oiXWfWwl8+WL4kcbhb+V9WcF
SpdzLCWVZRfhyDiN3+0zvyQ8RSG2Pd7CWn9zroI0e4sxGo0Ki6JUnIcXtZGOBDOQ
IIZdjXvGSfpJ+3u3XvRPXJcltRCtOsVWxYzrmvRlmHDW5QMe1+WmmrlojTePrLaJ
xn8+3WINqetAR+ZQnazbpt1XzJzKa8QtFgpiN0kT6qL7cg3N1Owc4vLGohl7wok=
=Nesf
-----END PGP SIGNATURE-----
Merge tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing update from Steven Rostedt:
"This batch of changes is mostly clean ups and small bug fixes. The
only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the
function graph tracer and not trace particular functions and their
call chain.
Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.
One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included all
the changes to get that working, and I pulled that into my tree in
order to complete the perf function tracing fix"
* tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add rcu annotation for syscall trace descriptors
tracing: Do not use signed enums with unsigned long long in fgragh output
tracing: Remove unused function ftrace_off_permanent()
tracing: Do not assign filp->private_data to freed memory
tracing: Add helper function tracing_is_disabled()
tracing: Open tracer when ftrace_dump_on_oops is used
tracing: Add support for SOFT_DISABLE to syscall events
tracing: Make register/unregister_ftrace_command __init
tracing: Update event filters for multibuffer
recordmcount.pl: Add support for __fentry__
ftrace: Have control op function callback only trace when RCU is watching
rcu: Do not trace rcu_is_watching() functions
ftrace/x86: skip over the breakpoint for ftrace caller
trace/trace_stat: use rbtree postorder iteration helper instead of opencoding
ftrace: Add set_graph_notrace filter
ftrace: Narrow down the protected area of graph_lock
ftrace: Introduce struct ftrace_graph_data
ftrace: Get rid of ftrace_graph_filter_enabled
tracing: Fix potential out-of-bounds in trace_get_user()
tracing: Show more exact help information about snapshot
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...