Commit Graph

124 Commits

Author SHA1 Message Date
Paul E. McKenney
baa1ae0c9f rcu: further lower priority in rcu_yield()
Although rcu_yield() dropped from real-time to normal priority, there
is always the possibility that the competing tasks have been niced.
So nice to 19 in rcu_yield() to help ensure that other tasks have a
better chance of running.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:59 -07:00
Lai Jiangshan
9ab1544eb4 rcu: introduce kfree_rcu()
Many rcu callbacks functions just call kfree() on the base structure.
These functions are trivial, but their size adds up, and furthermore
when they are used in a kernel module, that module must invoke the
high-latency rcu_barrier() function at module-unload time.

The kfree_rcu() function introduced by this commit addresses this issue.
Rather than encoding a function address in the embedded rcu_head
structure, kfree_rcu() instead encodes the offset of the rcu_head
structure within the base structure.  Because the functions are not
allowed in the low-order 4096 bytes of kernel virtual memory, offsets
up to 4095 bytes can be accommodated.  If the offset is larger than
4095 bytes, a compile-time error will be generated in __kfree_rcu().
If this error is triggered, you can either fall back to use of call_rcu()
or rearrange the structure to position the rcu_head structure into the
first 4096 bytes.

Note that the allowable offset might decrease in the future, for example,
to allow something like kmem_cache_free_rcu().

The new kfree_rcu() function can replace code as follows:

	call_rcu(&p->rcu, simple_kfree_callback);

where "simple_kfree_callback()" might be defined as follows:

	void simple_kfree_callback(struct rcu_head *p)
	{
		struct foo *q = container_of(p, struct foo, rcu);

		kfree(q);
	}

with the following:

	kfree_rcu(&p->rcu, rcu);

Note that the "rcu" is the name of a field in the structure being
freed.  The reason for using this rather than passing in a pointer
to the base structure is that the above approach allows better type
checking.

This commit is based on earlier work by Lai Jiangshan and Manfred Spraul:

Lai's V1 patch: http://lkml.org/lkml/2008/9/18/1
Manfred's patch: http://lkml.org/lkml/2009/1/2/115

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:59 -07:00
Paul E. McKenney
6cc68793e3 rcu: fix spelling
The "preemptible" spelling is preferable.  May as well fix it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:59 -07:00
Paul E. McKenney
f0a07aeaf8 rcu: Switch to this_cpu() primitives
This removes a couple of lines from invoke_rcu_cpu_kthread(), improving
readability.

Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:57 -07:00
Paul E. McKenney
5ece5bab3e rcu: Add forward-progress diagnostic for per-CPU kthreads
Increment a per-CPU counter on each pass through rcu_cpu_kthread()'s
service loop, and add it to the rcudata trace output.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:57 -07:00
Paul E. McKenney
15ba0ba860 rcu: add grace-period age and more kthread state to tracing
This commit adds the age in jiffies of the current grace period along
with the duration in jiffies of the longest grace period since boot
to the rcu/rcugp debugfs file.  It also adds an additional "O" state
to kthread tracing to differentiate between the kthread waiting due to
having nothing to do on the one hand and waiting due to being on the
wrong CPU on the other hand.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05 23:16:56 -07:00
Paul E. McKenney
4a29865689 rcu: make rcutorture version numbers available through debugfs
It is not possible to accurately correlate rcutorture output with that
of debugfs.  This patch therefore adds a debugfs file that prints out
the rcutorture version number, permitting easy correlation.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:56 -07:00
Paul E. McKenney
d71df90ead rcu: add tracing for RCU's kthread run states.
Add tracing to help debugging situations when RCU's kthreads are not
running but are supposed to be.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:56 -07:00
Paul E. McKenney
e3995a25fa rcu: put per-CPU kthread at non-RT priority during CPU hotplug operations
If you are doing CPU hotplug operations, it is best not to have
CPU-bound realtime tasks running CPU-bound on the outgoing CPU.
So this commit makes per-CPU kthreads run at non-realtime priority
during that time.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:55 -07:00
Paul E. McKenney
0f962a5e72 rcu: Force per-rcu_node kthreads off of the outgoing CPU
The scheduler has had some heartburn in the past when too many real-time
kthreads were affinitied to the outgoing CPU.  So, this commit lightens
the load by forcing the per-rcu_node and the boost kthreads off of the
outgoing CPU.  Note that RCU's per-CPU kthread remains on the outgoing
CPU until the bitter end, as it must in order to preserve correctness.

Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(),
given that this function can block.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05 23:16:55 -07:00
Paul E. McKenney
27f4d28057 rcu: priority boosting for TREE_PREEMPT_RCU
Add priority boosting for TREE_PREEMPT_RCU, similar to that for
TINY_PREEMPT_RCU.  This is enabled by the default-off RCU_BOOST
kernel parameter.  The priority to which to boost preempted
RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
(defaulting to real-time priority 1) and the time to wait before
boosting the readers who are blocking a given grace period is
controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
500 milliseconds).

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:55 -07:00
Paul E. McKenney
a26ac2455f rcu: move TREE_RCU from softirq to kthread
If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers.  Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked.  If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.

But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.

Also add comments and properly synchronized all accesses to
rcu_cpu_kthread_task, as suggested by Lai Jiangshan.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Paul E. McKenney
12f5f524ca rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists
Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
and ->exp_tasks tail pointers.  This is in preparation for RCU priority
boosting, which will add a third dimension to the combinatorial explosion
in the ->blocked_tasks[] case, but simply a third pointer in the new
->blkd_tasks case.

Also update documentation to reflect blocked_tasks[] merge

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Paul E. McKenney
e59fb3120b rcu: Decrease memory-barrier usage based on semi-formal proof
Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
invocations in rcu_process_callbacks() that are no longer needed, but
sheer paranoia prevented them from being removed.  This commit removes
them and provides a proof of correctness in their absence.  It also adds
a memory barrier to rcu_report_qs_rsp() immediately before the update to
rsp->completed in order to handle the theoretical possibility that the
compiler or CPU might move massive quantities of code into a lock-based
critical section.  This also proves that the sheer paranoia was not
entirely unjustified, at least from a theoretical point of view.

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period.  Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU.  (There is some hope
that the algorithms used in user-level RCU might be applied here, but
some work is required to handle the NMIs that user-space applications
can happily ignore.  For the short term, better safe than sorry.)

This proof assumes that neither compiler nor CPU will allow a lock
acquisition and release to be reordered, as doing so can result in
deadlock.  The proof is as follows:

1.	A given CPU declares a quiescent state under the protection of
	its leaf rcu_node's lock.

2.	If there is more than one level of rcu_node hierarchy, the
	last CPU to declare a quiescent state will also acquire the
	->lock of the next rcu_node up in the hierarchy,  but only
	after releasing the lower level's lock.  The acquisition of this
	lock clearly cannot occur prior to the acquisition of the leaf
	node's lock.

3.	Step 2 repeats until we reach the root rcu_node structure.
	Please note again that only one lock is held at a time through
	this process.  The acquisition of the root rcu_node's ->lock
	must occur after the release of that of the leaf rcu_node.

4.	At this point, we set the ->completed field in the rcu_state
	structure in rcu_report_qs_rsp().  However, if the rcu_node
	hierarchy contains only one rcu_node, then in theory the code
	preceding the quiescent state could leak into the critical
	section.  We therefore precede the update of ->completed with a
	memory barrier.  All CPUs will therefore agree that any updates
	preceding any report of a quiescent state will have happened
	before the update of ->completed.

5.	Regardless of whether a new grace period is needed, rcu_start_gp()
	will propagate the new value of ->completed to all of the leaf
	rcu_node structures, under the protection of each rcu_node's ->lock.
	If a new grace period is needed immediately, this propagation
	will occur in the same critical section that ->completed was
	set in, but courtesy of the memory barrier in #4 above, is still
	seen to follow any pre-quiescent-state activity.

6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
	aware of the end of the old grace period and therefore makes
	any RCU callbacks that were waiting on that grace period eligible
	for invocation.

	If this CPU is the same one that detected the end of the grace
	period, and if there is but a single rcu_node in the hierarchy,
	we will still be in the single critical section.  In this case,
	the memory barrier in step #4 guarantees that all callbacks will
	be seen to execute after each CPU's quiescent state.

	On the other hand, if this is a different CPU, it will acquire
	the leaf rcu_node's ->lock, and will again be serialized after
	each CPU's quiescent state for the old grace period.

On the strength of this proof, this commit therefore removes the memory
barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
The effect is to reduce the number of memory barriers by one and to
reduce the frequency of execution from about once per scheduling tick
per CPU to once per grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Paul E. McKenney
a00e0d714f rcu: Remove conditional compilation for RCU CPU stall warnings
The RCU CPU stall warnings can now be controlled using the
rcu_cpu_stall_suppress boot-time parameter or via the same parameter
from sysfs.  There is therefore no longer any reason to have
kernel config parameters for this feature.  This commit therefore
removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
kernel config parameters.  The RCU_CPU_STALL_TIMEOUT parameter remains
to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
remains to allow task-stall information to be suppressed if desired.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Linus Torvalds
72eb6a7914 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
  gameport: use this_cpu_read instead of lookup
  x86: udelay: Use this_cpu_read to avoid address calculation
  x86: Use this_cpu_inc_return for nmi counter
  x86: Replace uses of current_cpu_data with this_cpu ops
  x86: Use this_cpu_ops to optimize code
  vmstat: User per cpu atomics to avoid interrupt disable / enable
  irq_work: Use per cpu atomics instead of regular atomics
  cpuops: Use cmpxchg for xchg to avoid lock semantics
  x86: this_cpu_cmpxchg and this_cpu_xchg operations
  percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
  percpu,x86: relocate this_cpu_add_return() and friends
  connector: Use this_cpu operations
  xen: Use this_cpu_inc_return
  taskstats: Use this_cpu_ops
  random: Use this_cpu_inc_return
  fs: Use this_cpu_inc_return in buffer.c
  highmem: Use this_cpu_xx_return() operations
  vmstat: Use this_cpu_inc_return for vm statistics
  x86: Support for this_cpu_add, sub, dec, inc_return
  percpu: Generic support for this_cpu_add, sub, dec, inc_return
  ...

Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
2011-01-07 17:02:58 -08:00
Paul E. McKenney
b52573d279 rcu: reduce __call_rcu()-induced contention on rcu_node structures
When the current __call_rcu() function was written, the expedited
APIs did not exist.  The __call_rcu() implementation therefore went
to great lengths to detect the end of old grace periods and to start
new ones, all in the name of reducing grace-period latency.  Now the
expedited APIs do exist, and the usage of __call_rcu() has increased
considerably.  This commit therefore causes __call_rcu() to avoid
worrying about grace periods unless there are a large number of
RCU callbacks stacked up on the current CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-17 12:34:49 -08:00
Paul E. McKenney
0209f6490b rcu: limit rcu_node leaf-level fanout
Some recent benchmarks have indicated possible lock contention on the
leaf-level rcu_node locks.  This commit therefore limits the number of
CPUs per leaf-level rcu_node structure to 16, in other words, there
can be at most 16 rcu_data structures fanning into a given rcu_node
structure.  Prior to this, the limit was 32 on 32-bit systems and 64 on
64-bit systems.

Note that the fanout of non-leaf rcu_node structures is unchanged.  The
organization of accesses to the rcu_node tree is such that references
to non-leaf rcu_node structures are much less frequent than to the
leaf structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-17 12:34:20 -08:00
Paul E. McKenney
121dfc4b3e rcu: fine-tune grace-period begin/end checks
Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
should try to report a quiescent state.  Handle overflow in the check
for rdp->gpnum having fallen behind.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-17 12:34:20 -08:00
Frederic Weisbecker
5ff8e6f053 rcu: Keep gpnum and completed fields synchronized
When a CPU that was in an extended quiescent state wakes
up and catches up with grace periods that remote CPUs
completed on its behalf, we update the completed field
but not the gpnum that keeps a stale value of a backward
grace period ID.

Later, note_new_gpnum() will interpret the shift between
the local CPU and the node grace period ID as some new grace
period to handle and will then start to hunt quiescent state.

But if every grace periods have already been completed, this
interpretation becomes broken. And we'll be stuck in clusters
of spurious softirqs because rcu_report_qs_rdp() will make
this broken state run into infinite loop.

The solution, as suggested by Lai Jiangshan, is to ensure that
the gpnum and completed fields are well synchronized when we catch
up with completed grace periods on their behalf by other cpus.
This way we won't start noting spurious new grace periods.

Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-17 12:34:19 -08:00
Frederic Weisbecker
20377f32dc rcu: Stop chasing QS if another CPU did it for us
When a CPU is idle and others CPUs handled its extended
quiescent state to complete grace periods on its behalf,
it will catch up with completed grace periods numbers
when it wakes up.

But at this point there might be no more grace period to
complete, but still the woken CPU always keeps its stale
qs_pending value and will then continue to chase quiescent
states even if its not needed anymore.

This results in clusters of spurious softirqs until a new
real grace period is started. Because if we continue to
chase quiescent states but we have completed every grace
periods, rcu_report_qs_rdp() is puzzled and makes that
state run into infinite loops.

As suggested by Lai Jiangshan, just reset qs_pending if
someone completed every grace periods on our behalf.

Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-12-17 12:34:18 -08:00
Christoph Lameter
909ea96468 core: Replace __get_cpu_var with __this_cpu_read if not used for an address.
__get_cpu_var() can be replaced with this_cpu_read and will then use a
single read instruction with implied address calculation to access the
correct per cpu instance.

However, the address of a per cpu variable passed to __this_cpu_read()
cannot be determined (since it's an implied address conversion through
segment prefixes).  Therefore apply this only to uses of __get_cpu_var
where the address of the variable is not used.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:07:19 +01:00
Paul E. McKenney
2d999e03b7 rcu: update documentation/comments for Lai's adoption patch
Lai's RCU-callback immediate-adoption patch changes the RCU tracing
output, so update tracing.txt.  Also update a few comments to clarify
the synchronization design.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:59 -08:00
Lai Jiangshan
29494be71a rcu,cleanup: simplify the code when cpu is dying
When we handle the CPU_DYING notifier, the whole system is stopped except
for the current CPU.  We therefore need no synchronization with the other
CPUs.  This allows us to move any orphaned RCU callbacks directly to the
list of any online CPU without needing to run them through the global
orphan lists.  These global orphan lists can therefore be dispensed with.
This commit makes thes changes, though currently victimizes CPU 0 @@@.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29 22:01:58 -08:00
Dongdong Deng
4ee0a60392 rcu: using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value
Using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value
due to the caller didn't hold the root_rcu/rnp node's lock.  Although
use without ACCESS_ONCE() is safe due to the value loaded being used
but once, the ACCESS_ONCE() is a good documentation aid -- the variables
are being loaded without the services of a lock.

Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
CC: Dipankar Sarma <dipankar@in.ibm.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-10-07 10:41:06 -07:00
Paul E. McKenney
269dcc1c2e rcu: Add tracing data to support queueing models
The current tracing data is not sufficient to deduce the average time
that a callback spends waiting for a grace period to end.  Add three
per-CPU counters recording the number of callbacks invoked (ci), the
number of callbacks orphaned (co), and the number of callbacks adopted
(ca).  Given the existing callback queue length (ql), the average wait
time in absence of CPU hotplug operations is ql/ci.  The units of wait
time will be in terms of the duration over which ci was measured.

In the presence of CPU hotplug operations, there is room for argument,
but ql/(ci-co+ca) won't steer you too far wrong.

Also fixes a typo called out by Lucas De Marchi <lucas.de.marchi@gmail.com>.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-09-23 09:16:53 -07:00
Paul E. McKenney
53d84e004d rcu: permit suppressing current grace period's CPU stall warnings
When using a kernel debugger, a long sojourn in the debugger can get
you lots of RCU CPU stall warnings once you resume.  This might not be
helpful, especially if you are using the system console.  This patch
therefore allows RCU CPU stall warnings to be suppressed, but only for
the duration of the current set of grace periods.

This differs from Jason's original patch in that it adds support for
tiny RCU and preemptible RCU, and uses a slightly different method for
suppressing the RCU CPU stall warning messages.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Jason Wessel <jason.wessel@windriver.com>
2010-08-20 09:00:12 -07:00
Paul E. McKenney
8cdd32a918 rcu: refer RCU CPU stall-warning victims to stallwarn.txt
There is some documentation on RCU CPU stall warnings contained in
Documentation/RCU/stallwarn.txt, but it will not be apparent to someone
who runs into such a warning while under time pressure.  This commit
therefore adds comments preceding the printk()s pointing out the
location of this documentation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-08-20 09:00:11 -07:00
Paul E. McKenney
910b1b7e19 rcu: Allow RCU CPU stall warnings to be off at boot, but manually enablable
Currently, if RCU CPU stall warnings are enabled, they are enabled
immediately upon boot.  They can be manually disabled via /sys (and
also re-enabled via /sys), and are automatically disabled upon panic.
However, some users need RCU CPU stalls to be disabled at boot time,
but to be enabled without rebuilding/rebooting.  For example, someone
running a real-time application in production might not want the
additional latency of RCU CPU stall detection in normal operation, but
might need to enable it at any point for fault isolation purposes.

This commit therefore provides a new CONFIG_RCU_CPU_STALL_DETECTOR_RUNNABLE
kernel configuration parameter that maintains the current behavior
(enable at boot) by default, but allows a kernel to be configured
with RCU CPU stall detection built into the kernel, but disabled at
boot time.

Requested-by: Clark Williams <williams@redhat.com>
Requested-by: John Kacur <jkacur@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-08-19 17:18:04 -07:00
Paul E. McKenney
f2e0dd7090 rcu: allow RCU CPU stall warning messages to be controlled in /sys
Set the permissions of the rcu_cpu_stall_suppress to 644 to enable RCU
CPU stall warnings to be enabled and disabled at runtime via sysfs.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-08-19 17:18:03 -07:00
Paul E. McKenney
742734eea0 rcu: add boot parameter to suppress RCU CPU stall warning messages
Although the RCU CPU stall warning messages are a very good way to alert
people to a problem, once alerted, it is sometimes helpful to shut them
off in order to avoid obscuring other messages that might be being used
to track down the problem.  Although you can rebuild the kernel with
CONFIG_RCU_CPU_STALL_DETECTOR=n, this is sometimes inconvenient.  This
commit therefore adds a boot parameter named "rcu_cpu_stall_suppress"
that shuts these messages off without requiring a rebuild (though a
reboot might be needed for those not brave enough to patch their kernel
while it is running).

This message-suppression was already in place for the panic case, so this
commit need only rename the variable and export it via module_param().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-08-19 17:18:02 -07:00
Lai Jiangshan
394f99a900 rcu: simplify the usage of percpu data
&percpu_data is compatible with allocated percpu data.

And we use it and remove the "->rda[NR_CPUS]" array, saving significant
storage on systems with large numbers of CPUs.  This does add an additional
level of indirection and thus an additional cache line referenced, but
because ->rda is not used on the read side, this is OK.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2010-08-19 17:18:01 -07:00
Mathieu Desnoyers
551d55a944 tree/tiny rcu: Add debug RCU head objects
Helps finding racy users of call_rcu(), which results in hangs because list
entries are overwritten and/or skipped.

Changelog since v4:
- Bissectability is now OK
- Now generate a WARN_ON_ONCE() for non-initialized rcu_head passed to
  call_rcu(). Statically initialized objects are detected with
  object_is_static().
- Rename rcu_head_init_on_stack to init_rcu_head_on_stack.
- Remove init_rcu_head() completely.

Changelog since v3:
- Include comments from Lai Jiangshan

This new patch version is based on the debugobjects with the newly introduced
"active state" tracker.

Non-initialized entries are all considered as "statically initialized". An
activation fixup (triggered by call_rcu()) takes care of performing the debug
object initialization without issuing any warning. Since we cannot increase the
size of struct rcu_head, I don't see much room to put an identifier for
statically initialized rcu_head structures. So for now, we have to live without
"activation without explicit init" detection. But the main purpose of this debug
option is to detect double-activations (double call_rcu() use of a rcu_head
before the callback is executed), which is correctly addressed here.

This also detects potential internal RCU callback corruption, which would cause
the callbacks to be executed twice.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: David S. Miller <davem@davemloft.net>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: akpm@linux-foundation.org
CC: mingo@elte.hu
CC: laijs@cn.fujitsu.com
CC: dipankar@in.ibm.com
CC: josh@joshtriplett.org
CC: dvhltc@us.ibm.com
CC: niv@us.ibm.com
CC: tglx@linutronix.de
CC: peterz@infradead.org
CC: rostedt@goodmis.org
CC: Valdis.Kletnieks@vt.edu
CC: dhowells@redhat.com
CC: eric.dumazet@gmail.com
CC: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2010-06-14 16:37:26 -07:00
Paul E. McKenney
72d5a9f7a9 rcu: remove all rcu head initializations, except on_stack initializations
Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-11 16:10:47 -07:00
Paul E. McKenney
d21670acab rcu: reduce the number of spurious RCU_SOFTIRQ invocations
Lai Jiangshan noted that up to 10% of the RCU_SOFTIRQ are spurious, and
traced this down to the fact that the current grace-period machinery
will uselessly raise RCU_SOFTIRQ when a given CPU needs to go through
a quiescent state, but has not yet done so.  In this situation, there
might well be nothing that RCU_SOFTIRQ can do, and the overhead can be
worth worrying about in the ksoftirqd case.  This patch therefore avoids
raising RCU_SOFTIRQ in this situation.

Changes since v1 (http://lkml.org/lkml/2010/3/30/122 from Lai Jiangshan):

o	Omit the rcu_qs_pending() prechecks, as they aren't that
	much less expensive than the quiescent-state checks.

o	Merge with the set_need_resched() patch that reduces IPIs.

o	Add the new n_rp_report_qs field to the rcu_pending tracing output.

o	Update the tracing documentation accordingly.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:35 -07:00
Paul E. McKenney
4a90a0681c rcu: permit discontiguous cpu_possible_mask CPU numbering
TREE_RCU assumes that CPU numbering is contiguous, but some users need
large holes in the numbering to better map to hardware layout.  This patch
makes TREE_RCU (and TREE_PREEMPT_RCU) tolerate large holes in the CPU
numbering.  However, NR_CPUS must still be greater than the largest
CPU number.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:35 -07:00
Paul E. McKenney
4300aa642c rcu: improve RCU CPU stall-warning messages
The existing RCU CPU stall-warning messages can be confusing, especially
in the case where one CPU detects a single other stalled CPU.  In addition,
the console messages did not say which flavor of RCU detected the stall,
which can make it difficult to work out exactly what is causing the stall.
This commit improves these messages.

Requested-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
26845c2860 rcu: print boot-time console messages if RCU configs out of ordinary
Print boot-time messages if tracing is enabled, if fanout is set
to non-default values, if exact fanout is specified, if accelerated
dyntick-idle grace periods have been enabled, if RCU-lockdep is enabled,
if rcutorture has been boot-time enabled, if the CPU stall detector has
been disabled, or if four-level hierarchy has been enabled.

This is all for TREE_RCU and TREE_PREEMPT_RCU.  TINY_RCU will be handled
separately, if at all.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
c68de2097a rcu: disable CPU stall warnings upon panic
The current RCU CPU stall warnings remain enabled even after a panic
occurs, which some people have found to be a bit counterproductive.
This patch therefore uses a notifier to disable stall warnings once a
panic occurs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
bbad937983 rcu: slim down rcutiny by removing rcu_scheduler_active and friends
TINY_RCU does not need rcu_scheduler_active unless CONFIG_DEBUG_LOCK_ALLOC.
So conditionally compile rcu_scheduler_active in order to slim down
rcutiny a bit more.  Also gets rid of an EXPORT_SYMBOL_GPL, which is
responsible for most of the slimming.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
25502a6c13 rcu: refactor RCU's context-switch handling
The addition of preemptible RCU to treercu resulted in a bit of
confusion and inefficiency surrounding the handling of context switches
for RCU-sched and for RCU-preempt.  For RCU-sched, a context switch
is a quiescent state, pure and simple, just like it always has been.
For RCU-preempt, a context switch is in no way a quiescent state, but
special handling is required when a task blocks in an RCU read-side
critical section.

However, the callout from the scheduler and the outer loop in ksoftirqd
still calls something named rcu_sched_qs(), whose name is no longer
accurate.  Furthermore, when rcu_check_callbacks() notes an RCU-sched
quiescent state, it ends up unnecessarily (though harmlessly, aside
from the performance hit) enqueuing the current task if it happens to
be running in an RCU-preempt read-side critical section.  This not only
increases the maximum latency of scheduler_tick(), it also needlessly
increases the overhead of the next outermost rcu_read_unlock() invocation.

This patch addresses this situation by separating the notion of RCU's
context-switch handling from that of RCU-sched's quiescent states.
The context-switch handling is covered by rcu_note_context_switch() in
general and by rcu_preempt_note_context_switch() for preemptible RCU.
This permits rcu_sched_qs() to handle quiescent states and only quiescent
states.  It also reduces the maximum latency of scheduler_tick(), though
probably by much less than a microsecond.  Finally, it means that tasks
within preemptible-RCU read-side critical sections avoid incurring the
overhead of queuing unless there really is a context switch.

Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-05-10 11:08:33 -07:00
Lai Jiangshan
0c34029abd rcu: move some code from macro to function
Shrink the RCU_INIT_FLAVOR() macro by moving all but the initialization
of the ->rda[] array to rcu_init_one().  The call to rcu_init_one()
can then be moved to the end of the RCU_INIT_FLAVOR() macro, which is
required because rcu_boot_init_percpu_data(), which is now called from
rcu_init_one(), depends on the initialization of the ->rda[] array.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Lai Jiangshan
f261414f0d rcu: make dead code really dead
cleanup: make dead code really dead

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Paul E. McKenney
d25eb9442b rcu: substitute set_need_resched for sending resched IPIs
This patch adds a check to __rcu_pending() that does a local
set_need_resched() if the current CPU is holding up the current grace
period and if force_quiescent_state() will be called soon.  The goal is
to reduce the probability that force_quiescent_state() will need to do
smp_send_reschedule(), which sends an IPI and is therefore more expensive
on most architectures.

Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Paul E. McKenney
a47cd880b5 rcu: Fix accelerated grace periods for last non-dynticked CPU
It is invalid to invoke __rcu_process_callbacks() with irqs
disabled, so do it indirectly via raise_softirq().  This
requires a state-machine implementation to cycle through the
grace-period machinery the required number of times.

Located-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267231138-27856-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-27 09:53:52 +01:00
Paul E. McKenney
d9f1bb6ad7 rcu: Make rcu_read_lock_sched_held() take boot time into account
Before the scheduler starts, all tasks are non-preemptible by
definition. So, during that time, rcu_read_lock_sched_held()
needs to always return "true".  This patch makes that be so.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267135607-7056-2-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 08:20:46 +01:00
Paul E. McKenney
1ed509a225 rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
When RCU detects a grace-period stall, it currently just prints
out the PID of any tasks doing the stalling.  This patch adds
RCU_CPU_STALL_VERBOSE, which enables the more-verbose reporting
from sched_show_task().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-21-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:35:02 +01:00
Paul E. McKenney
3acd9eb31c rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
Under TREE_PREEMPT_RCU, print_other_cpu_stall() invokes
rcu_print_task_stall() with the root rcu_node structure's ->lock
held, and rcu_print_task_stall() acquires that same lock for
self-deadlock. Fix this by removing the lock acquisition from
rcu_print_task_stall(), and making all callers acquire the lock
instead.

Tested-by: John Kacur <jkacur@redhat.com>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Located-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-19-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:59 +01:00
Paul E. McKenney
1304afb225 rcu: Convert to raw_spinlocks
The spinlocks in rcutree need to be real spinlocks in
preempt-rt. Convert them to raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-18-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:58 +01:00
Paul E. McKenney
20133cfce7 rcu: Stop overflowing signed integers
The C standard does not specify the result of an operation that
overflows a signed integer, so such operations need to be
avoided.  This patch changes the type of several fields from
"long" to "unsigned long" and adjusts operations as needed.
ULONG_CMP_GE() and ULONG_CMP_LT() macros are introduced to do
the modular comparisons that are appropriate given that overflow
is an expected event.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-17-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:57 +01:00