Those two APIs were provided to optimize the calls of
tick_nohz_idle_enter() and rcu_idle_enter() into a single
irq disabled section. This way no interrupt happening in-between would
needlessly process any RCU job.
Now we are talking about an optimization for which benefits
have yet to be measured. Let's start simple and completely decouple
idle rcu and dyntick idle logics to simplify.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.
To prepare for fixing this, add two new APIs:
tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu().
If no use of RCU is made in the idle loop between
tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch
must instead call the new *_norcu() version such that the arch doesn't
need to call rcu_idle_enter() and rcu_idle_exit().
Otherwise the arch must call tick_nohz_enter_idle() and
tick_nohz_exit_idle() and also call explicitly:
- rcu_idle_enter() after its last use of RCU before the CPU is put
to sleep.
- rcu_idle_exit() before the first use of RCU after the CPU is woken
up.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The tick_nohz_stop_sched_tick() function, which tries to delay
the next timer tick as long as possible, can be called from two
places:
- From the idle loop to start the dytick idle mode
- From interrupt exit if we have interrupted the dyntick
idle mode, so that we reprogram the next tick event in
case the irq changed some internal state that requires this
action.
There are only few minor differences between both that
are handled by that function, driven by the ts->inidle
cpu variable and the inidle parameter. The whole guarantees
that we only update the dyntick mode on irq exit if we actually
interrupted the dyntick idle mode, and that we enter in RCU extended
quiescent state from idle loop entry only.
Split this function into:
- tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
dynticks idle mode unconditionally if it can, and enters into RCU
extended quiescent state.
- tick_nohz_irq_exit() which only updates the dynticks idle mode
when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).
To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
into tick_nohz_idle_exit().
This simplifies the code and micro-optimize the irq exit path (no need
for local_irq_save there). This also prepares for the split between
dynticks and rcu extended quiescent state logics. We'll need this split to
further fix illegal uses of RCU in extended quiescent states in the idle
loop.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Or else we get lots of variations on this:
arch/mips/pci/pci.c:330: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
scattered throughout the build.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
None of these files are using modular infrastructure, and build
tests reveal that none of these files are really relying on any
implicit inclusions via. module.h either. So delete them.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The address limit is already set in flush_old_exec() via set_fs(USER_DS)
so this assignment is redundant.
[ralf@linux-mips.org: also see dac853ae89 for
further explanation.]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/2466/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The unwind_stack_by_address variant supports unwinding based
on any kernel code address.
This symbol is also exported so it can be called from modules.
Signed-off-by: Daniel Kalmar <kalmard@homejinni.com>
Signed-off-by: Gergely Kis <gergely@homejinni.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
We never needed that (->regs[2] is overwritten on return from syscall paths
with return value of syscall, so storing it there early made no sense) and
with new restart logics since d27240bf7e61d2656de18e158ec910a902030847 it
has become really bad - we lose the original syscall number before the
place where we decide that we might need a syscall restart.
Note that for child we do need the assignment to regs[2] - it won't go
through the normal return from syscall path.
[Ralf: Issue found and reported by Lluís; initial investigations by me;
bug finally found and patch by Al; testing by me and Lluís.]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Lluís Batlle i Rossell <viriketo@gmail.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Like x86 did in arch/x86/kernel/{process_32.c,process_64.c}, also don't
trace irqsoff for idle.
If there's no useful work to be done, we don't care about the irqsoff
duration. If we trace the idle process, the max duration of irqsoff will
be the idle time and make the irqsoff tracer useless.
Signed-off-by: Wu Zhangjin <wuzhangjin@gmail.com>
Cc: linux-mips@linux-mips.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Patchwork: http://patchwork.linux-mips.org/patch/1044/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
The resume() implementation octeon_switch.S examines the saved cp0_status
register. We were clobbering the entire pt_regs structure in kernel
threads leading to random crashes.
When switching away from a kernel thread, the saved cp0_status is examined
and if bit 30 is set it is cleared and the CP2 state saved into the pt_regs
structure. Since the kernel thread stack overlaid the pt_regs structure
this resulted in a corrupt stack. When the kthread with the corrupt stack
was resumed, it could crash if it used any of the data in the stack that was
clobbered.
We fix it by moving the kernel thread stack down so it doesn't overlay
pt_regs.
Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Each platform has to add support for CPU hotplugging itself by providing
suitable definitions for the cpu_disable and cpu_die of the smp_ops
methods and setting SYS_SUPPORTS_HOTPLUG_CPU. A platform should only set
SYS_SUPPORTS_HOTPLUG_CPU once all it's smp_ops definitions have the
necessary changes. This patch contains the changes to the dummy smp_ops
definition for uni-processor systems.
Parts of the code contributed by Cavium Inc.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This patch also includes the required removal of (unused) inclusion of
<asm/a.out.h> <linux/a.out.h>'s in the arch/ code for these
architectures.
[dwmw2: updated for 2.6.27-rc]
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Jack Ren and Eric Miao tracked down the following long standing
problem in the NOHZ code:
scheduler switch to idle task
enable interrupts
Window starts here
----> interrupt happens (does not set NEED_RESCHED)
irq_exit() stops the tick
----> interrupt happens (does set NEED_RESCHED)
return from schedule()
cpu_idle(): preempt_disable();
Window ends here
The interrupts can happen at any point inside the race window. The
first interrupt stops the tick, the second one causes the scheduler to
rerun and switch away from idle again and we end up with the tick
disabled.
The fact that it needs two interrupts where the first one does not set
NEED_RESCHED and the second one does made the bug obscure and extremly
hard to reproduce and analyse. Kudos to Jack and Eric.
Solution: Limit the NOHZ functionality to the idle loop to make sure
that we can not run into such a situation ever again.
cpu_idle()
{
preempt_disable();
while(1) {
tick_nohz_stop_sched_tick(1); <- tell NOHZ code that we
are in the idle loop
while (!need_resched())
halt();
tick_nohz_restart_sched_tick(); <- disables NOHZ mode
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}
In hindsight we should have done this forever, but ...
/me grabs a large brown paperbag.
Debugged-by: Jack Ren <jack.ren@marvell.com>,
Debugged-by: eric miao <eric.y.miao@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Never terribly functional or popular, plagued by hard to fix bugs the time
to say goodbye has more than arrived.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Match the R4000 semantics for the initial state of interrupt/kernel
status register flags for the R3000 in kernel_thread().
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Convert old/obsolete NORET_TYPE and ATTRIB_NORET macros to use the
newer standard of "__noreturn" as defined in compiler-gcc.h.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This new routine doesn't lookup for symbol names. So we needn't
to pass any char buffers or pointer since we don't care about
names.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Make sure cpu_has_fpu (which uses smp_processor_id()) is used only in
atomic context.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
If the PC was ret_from_irq or ret_from_exception, there will be no
more normal stackframe. Instead of stopping the unwinding, use PC and
RA saved by an exception handler to continue unwinding into the
interrupted context. This also simplifies the CONFIG_STACKTRACE code.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Implement stacktrace interface by using unwind_stack() and enable lockdep
support in Kconfig.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This array was used to 'cache' some frame info about scheduler
functions to speed up get_wchan(). This array was 1Ko size and
was only used when CONFIG_KALLSYMS was set but declared for all
configs.
Rather than make the array statement conditional, this patches
removes this array and its uses. Indeed the common case doesn't
seem to use this array and get_wchan() is not a critical path
anyways.
It results in a smaller bss and a smaller/cleaner code:
text data bss dec hex filename
2543808 254148 139296 2937252 2cd1a4 vmlinux-new-get-wchan
2544080 254148 143392 2941620 2ce2b4 vmlinux~old
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This patch adds 2 sanity checks.
The first one test that the start address of the function to analyze has been
set by the caller. If not return an error since nothing usefull can be done
without.
The second one checks that the function's size has been set. A null size can
happen if CONFIG_KALLSYMS is not set and it means that we don't know the size
of the function to analyze. In this case, we make it equal to 128 instructions
by default.
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
This patch allows unwind_stack() to return ra for leaf function.
But it tries to detects cases where get_frame_info() wrongly
consider nested function as a leaf one.
It also pass 'unsinged long *sp' instead of 'unsigned long **sp'
as second parameter. The code looks cleaner.
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Now get_frame_info() wants to detect move sp instruction first. It
assumes that the save ra in the stack instruction can't happen
before allocating frame size space into the stack.
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Instead of dump all possible address in the stack, unwind the stack frame
based on prologue code analysis, as like as get_wchan() does. While the
code analysis might fail for some reason, there is a new kernel option
"raw_show_trace" to disable this feature.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>