Fix interrupt emulation code in kretprobe-booster according to
pt_regs update (es/ds change and gs adding).
This issue has been reported on systemtap-bugzilla:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=9965
| On a -tip kernel on x86_32, kretprobe_example (from samples) triggers the
| following backtrace when its retprobing a class of functions that cause a
| copy_from/to_user().
|
| BUG: sleeping function called from invalid context at mm/memory.c:3196
| in_atomic(): 0, irqs_disabled(): 1, pid: 2286, name: cat
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: systemtap-ml <systemtap@sources.redhat.com>
LKML-Reference: <49C7995C.2010601@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When I review the sensitive code ftrace_nmi_enter(), I found
the atomic variable nmi_running does protect NMI VS do_ftrace_mod_code(),
but it can not protects NMI(entered nmi) VS NMI(ftrace_nmi_enter()).
cpu#1 | cpu#2 | cpu#3
ftrace_nmi_enter() | do_ftrace_mod_code() |
not modify | |
------------------------|-----------------------|--
executing | set mod_code_write = 1|
executing --|-----------------------|--------------------
executing | | ftrace_nmi_enter()
executing | | do modify
------------------------|-----------------------|-----------------
ftrace_nmi_exit() | |
cpu#3 may be being modified the code which is still being executed on cpu#1,
it will have undefined results and possibly take a GPF, this patch
prevents it occurred.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <49C0B411.30003@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Don't boost at the addresses which are listed on exception tables,
because major page fault will occur on those addresses. In that case,
kprobes can not ensure that when instruction buffer can be freed since
some processes will sleep on the buffer.
kprobes-ia64 already has same check.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order for ntpd to correctly synchronize the clocks, the frequency of
the system clock must not be off by more than 500 ppm (or, put another
way, 1:2000), or ntpd will end up giving up on trying to synchronize
properly, and ends up reseting the clock in jumps instead.
The fast TSC PIT calibration sometimes failed this test - it was
assuming that the PIT reads always took about one microsecond each (2us
for the two reads to get a 16-bit timer), and that calibrating TSC to
the PIT over 15ms should thus be sufficient to get much closer than
500ppm (max 2us error on both sides giving 4us over 15ms: a 270 ppm
error value).
However, that assumption does not always hold: apparently some hardware
is either very much slower at reading the PIT registers, or there was
other noise causing at least one machine to get 700+ ppm errors.
So instead of using a fixed 15ms timing loop, this changes the fast PIT
calibration to read the TSC delta over the individual PIT timer reads,
and use the result to calculate the error bars on the PIT read timing
properly. We then successfully calibrate the TSC only if the maximum
error bars fall below 500ppm.
In the process, we also relax the timing to allow up to 25ms for the
calibration, although it can happen much faster depending on hardware.
Reported-and-tested-by: Jesper Krogh <jesper@krogh.cc>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During bootup, when we reprogram the PIT (programmable interval timer)
to start counting down from 0xffff in order to use it for the fast TSC
calibration, we should also make sure to delay a bit afterwards to allow
the PIT hardware to actually start counting with the new value.
That will happens at the next CLK pulse (1.193182 MHz), so the easiest
way to do that is to just wait at least one microsecond after
programming the new PIT counter value. We do that by just reading the
counter value back once - which will take about 2us on PC hardware.
Reported-and-tested-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix:
arch/x86/kernel/entry_32.S:446: Warning: 00000000080001d1 shortened to 00000000000001d1
arch/x86/kernel/entry_32.S:457: Warning: 000000000800feff shortened to 000000000000feff
arch/x86/kernel/entry_32.S:527: Warning: 00000000080001d1 shortened to 00000000000001d1
arch/x86/kernel/entry_32.S:541: Warning: 000000000800feff shortened to 000000000000feff
arch/x86/kernel/entry_32.S:676: Warning: 0000000008000091 shortened to 0000000000000091
TIF_SYSCALL_FTRACE is 0x08000000 and until now we checked the
first 16 bits of the work mask - bit 27 falls outside of that.
Update the entry_32.S code to check the full 32-bit mask.
[ %cx => %ecx fix from Cyrill Gorcunov <gorcunov@gmail.com> ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "H. Peter Anvin" <hpa@kernel.org>
LKML-Reference: <1237012693.18733.3.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
kernel/built-in.o: In function `ftrace_syscall_exit':
(.text+0x76667): undefined reference to `syscall_nr_to_meta'
ftrace.o is built:
obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
But now a CONFIG_FTRACE_SYSCALLS dependency is needed too.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide the x86 trace callbacks to trace syscalls.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: mark save_paranoid as non-kprobe-able code
This appears to be necessary as the function gets called from
kprobes-unsafe exception handling stubs (i.e. which themselves
live in .kprobes.text).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B8F44F.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
These got left in needlessly when ret_from_fork got simplified.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B8F355.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move store_ldt outside the CONFIG_PARAVIRT section and
also clean up the code a bit.
Signed-off-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: saving power _very_ little
round_jiffies() round up absolute jiffies to full second.
round_jiffies_relative() round up relative jiffies to full second.
The "t->expires" is absolute jiffies. Then, round_jiffies() should be
used instead round_jiffies_relative().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: New major feature
This patch add kexec jump support for x86_64. More information about
kexec jump can be found in corresponding x86_32 support patch.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Fix corner case that cannot yet occur
image->start may be outside of 0 ~ max_pfn, for example when jumping
back to original kernel from kexeced kenrel. This patch add identity
map for pages at image->start.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup
Fix some coding style issue for kexec x86.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Introduce:
cat /sys/kernel/debug/x86/cpu/*
for Intel and AMD processors to view / debug the state of each CPU.
By using this we can debug whole range of registers and other
cpu information for debugging purpose and monitor how things
are changing.
This can be useful for developers as well as for users.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1236701373.3387.4.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e088e4c9cd.
Removing the sysfs interface for p4-clockmod was flagged as a
regression in bug 12826.
Course of action:
- Find out the remaining causes of overheating, and fix them
if possible. ACPI should be doing the right thing automatically.
If it isn't, we need to fix that.
- mark p4-clockmod ui as deprecated
- try again with the removal in six months.
It's not really feasible to printk about the deprecation, because
it needs to happen at all the sysfs entry points, which means adding
a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: cleanup and code size reduction on 64-bit
This code is only applied to Intel Pentium and AMD K7 32-bit cpus.
Move those checks to intel_init()/amd_init() for 32-bit
so 64-bit will not build this code.
Also change to use cpu_index check to see if we need to emit warning.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B377D2.8030108@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In uv_flush_tlb_others() (arch/x86/kernel/tlb_uv.c),
the "WARN_ON(!in_atomic())" fails if CONFIG_PREEMPT is not enabled.
And CONFIG_PREEMPT is not enabled by default in the distribution that
most UV owners will use.
We could #ifdef CONFIG_PREEMPT the warning, but that is not good form.
And there seems to be no suitable fix to in_atomic() when CONFIG_PREMPT
is not on.
As Ingo commented:
> and we have no proper primitive to test for atomicity. (mainly
> because we dont know about atomicity on a non-preempt kernel)
So we drop the WARN_ON.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use fixmaps instead of vmap/vunmap in text_poke() for avoiding
page allocation and delayed unmapping.
At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <49B14352.2040705@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the mutual exclusion provided by the text edit lock in alternatives code.
Since alternative_smp_* will be called from module init code, etc,
we'd better protect it from other subsystems.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <49B14332.9030109@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ds_write_config() can write the BTS as well as the PEBS part of
the DS config. ds_request_pebs() passes the wrong qualifier, which
results in the wrong configuration to be written.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090305085721.A22550@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In case a ptraced task is reaped (while the tracer is still attached),
ds_exit_thread() is called before ptrace_exit(). The latter will
release the bts_tracer and remove the thread's ds_ctx.
The former will WARN() if the context is not NULL.
Oleg Nesterov submitted patches that move ptrace_exit() before
exit_thread() and thus reverse the order of the above calls.
Remove the bad warning. I will add it again when Oleg's changes are in.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090305084954.A22000@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix relocation overflow during module load
x86_64 uses 32bit relocations for symbol access and static percpu
symbols whether in core or modules must be inside 2GB of the percpu
segement base which the dynamic percpu allocator doesn't guarantee.
This patch makes x86_64 reserve PERCPU_MODULE_RESERVE bytes in the
first chunk so that module percpu areas are always allocated from the
first chunk which is always inside the relocatable range.
This problem exists for any percpu allocator but is easily triggered
when using the embedding allocator because the second chunk is located
beyond 2GB on it.
This patch also changes the meaning of PERCPU_DYNAMIC_RESERVE such
that it only indicates the size of the area to reserve for dynamic
allocation as static and dynamic areas can be separate. New
PERCPU_DYNAMIC_RESERVED is increased by 4k for both 32 and 64bits as
the reserved area separation eats away some allocatable space and
having slightly more headroom (currently between 4 and 8k after
minimal boot sans module area) makes sense for common case
performance.
x86_32 can address anywhere from anywhere and doesn't need reserving.
Mike Galbraith first reported the problem first and bisected it to the
embedding percpu allocator commit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Impact: add reserved allocation functionality and use it for module
percpu variables
This patch implements reserved allocation from the first chunk. When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator. This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.
If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation. If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation. The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.
If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: reduce unnecessary memory usage on certain configurations
Embedding percpu allocator allocates unit_size *
smp_num_possible_cpus() bytes consecutively and use it for the first
chunk. However, if the static area is small, this can result in
excessive prellocated free space in the first chunk due to
PCPU_MIN_UNIT_SIZE restriction.
This patch makes embedding percpu allocator preallocate only what's
necessary as described by PERPCU_DYNAMIC_RESERVE and return the
leftover to the bootmem allocator.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: argument semantic cleanup
In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant
auto-sizing. It's okay for @unit_size as 0 doesn't make sense but 0
dynamic reserve size is valid. Alos, if arch @dyn_size is calculated
from other parameters, it might end up passing in 0 @dyn_size and
malfunction when the size is automatically adjusted.
This patch makes both @unit_size and @dyn_size ssize_t and use -1 for
auto sizing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, micro-optimization
Pre-initialize boot_cpu_data.x86_phys_bits to a reasonable default
to remove the use of system_state tests in __virt_addr_valid()
and __phys_addr().
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: decrease hangs risks with the graph tracer on slow systems
Since the function graph tracer can spend too much time on timer
interrupts, it's better now to use the more lightweight local
clock. Anyway, the function graph traces are more reliable on a
per cpu trace.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af243d.06e9300a.53ad.ffff840c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell XPS710 will hang on reboot. This is resolved by adding a quirk to
set bios reboot.
Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: "manoj.iyer" <manoj.iyer@canonical.com>
Cc: <stable@kernel.org>
LKML-Reference: <1236196380.3231.89.camel@emiko>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save a bit of RAM
Get the exact size for the reserve_bootmem() call.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49AE4922.605@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>