linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2025-01-21 13:19:00 +07:00

Author	SHA1	Message	Date
Li Zefan	3aaba20f26	tracing: Fix a race in function profile While we are reading trace_stat/functionX and someone just disabled function_profile at that time, we can trigger this: divide error: 0000 [#1] PREEMPT SMP ... EIP is at function_stat_show+0x90/0x230 ... This fix just takes the ftrace_profile_lock and checks if rec->counter is 0. If it's 0, we know the profile buffer has been reset. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: stable@kernel.org LKML-Reference: <4C723644.4040708@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-31 16:46:23 -04:00
Tejun Heo	9c37547ab6	workqueue: use zalloc_cpumask_var() for gcwq->mayday_mask alloc_mayday_mask() was using alloc_cpumask_var() making gcwq->mayday_mask contain garbage after initialization on CONFIG_CPUMASK_OFFSTACK=y configurations. This combined with the previously fixed GCWQ_DISASSOCIATED initialization bug could make rescuers fall into infinite loop trying to bind to an offline cpu. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com>	2010-08-31 11:18:34 +02:00
Tejun Heo	477a3c33d1	workqueue: fix GCWQ_DISASSOCIATED initialization init_workqueues() incorrectly marks workqueues for all possible CPUs associated. Combined with mayday_mask initialization bug, this can make rescuers keep trying to bind to an offline gcwq indefinitely. Fix init_workqueues() such that only online CPUs have their gcwqs have GCWQ_DISASSOCIATED cleared. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com>	2010-08-31 10:54:35 +02:00
Ingo Molnar	daab7fc734	Merge commit 'v2.6.36-rc3' into x86/memblock Conflicts: arch/x86/kernel/trampoline.c mm/memblock.c Merge reason: Resolve the conflicts, update to latest upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-31 09:45:46 +02:00
Stephane Eranian	fa66f07aa1	perf_events: Fix time tracking for events with pid != -1 and cpu != -1 Per-thread events with a cpu filter, i.e., cpu != -1, were not reporting correct timings when the thread never ran on the monitored cpu. The time enabled was reported as a negative value. This patch fixes the problem by updating tstamp_stopped, tstamp_running in event_sched_out() for events with filters and which are marked as INACTIVE. The function group_sched_out() is modified to systematically call into event_sched_out() to avoid duplicating the timing adjustment code twice. With the patch, I now get: $ task_cpu -i -e unhalted_core_cycles,unhalted_core_cycles noploop 2 noploop for 2 seconds CPU0 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU0 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU1 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU1 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU2 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU2 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594) CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594) Signed-off-by: Stephane Eranian <eranian@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: davem@davemloft.net Cc: fweisbec@gmail.com Cc: perfmon2-devel@lists.sf.net Cc: eranian@google.com LKML-Reference: <4c76802d.aae9d80a.115d.70fe@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-30 12:16:55 +02:00
Linus Torvalds	6f4dbeca1a	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM QoS: Fix inline documentation. PM QoS: Fix kzalloc() parameters swapped in pm_qos_power_open()	2010-08-28 14:06:19 -07:00
Linus Torvalds	2637d139fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: pxa27x_keypad - remove input_free_device() in pxa27x_keypad_remove() Input: mousedev - fix regression of inverting axes Input: uinput - add devname alias to allow module on-demand load Input: hil_kbd - fix compile error USB: drop tty argument from usb_serial_handle_sysrq_char() Input: sysrq - drop tty argument form handle_sysrq() Input: sysrq - drop tty argument from sysrq ops handlers	2010-08-28 13:55:31 -07:00
Yinghai Lu	a587d2daeb	x86: Remove not used early_res code and some functions in e820.c that are not used anymore Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-08-27 11:13:51 -07:00
Paul E. McKenney	dd7c4d8973	rcu: performance fixes to TINY_PREEMPT_RCU callback checking This commit tightens up checks in rcu_preempt_check_callbacks() to avoid unnecessary special handling at rcu_read_unlock() time. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-27 10:51:17 -07:00
Frederic Weisbecker	98ee74a75c	Merge branch 'perf/urgent' into perf/core Conflicts: tools/perf/util/callchain.h Merge reason: Fix a non-trivial conflict with latest fixes	2010-08-27 02:30:07 +02:00
Saravana Kannan	25cc69ec34	PM QoS: Fix inline documentation. Fix the pm_qos_add_request() kerneldoc comment that doesn't reflect the behavior of the function after the last PM QoS update. Signed-off-by: Saravana Kannan <skannan@codeaurora.org> Acked-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-08-26 20:18:43 +02:00
Linus Torvalds	d4348c6789	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, x86, Pentium4: Clear the P4_CCCR_FORCE_OVF flag tracing/trace_stack: Fix stack trace on ppc64	2010-08-25 10:50:07 -07:00
Linus Torvalds	5e686019df	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states sched: Fix rq->clock synchronization when migrating tasks	2010-08-25 08:40:56 -07:00
Ingo Molnar	7de5d895b2	Merge branch 'linus' into perf/core Merge reason: pick up perf fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-25 13:10:00 +02:00
Anton Blanchard	151772dbfa	tracing/trace_stack: Fix stack trace on ppc64 save_stack_trace() stores the instruction pointer, not the function descriptor. On ppc64 the trace stack code currently dereferences the instruction pointer and shows 8 bytes of instructions in our backtraces: # cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (26 entries) ----- ---- -------- 0) 5424 112 0x6000000048000004 1) 5312 160 0x60000000ebad01b0 2) 5152 160 0x2c23000041c20030 3) 4992 240 0x600000007c781b79 4) 4752 160 0xe84100284800000c 5) 4592 192 0x600000002fa30000 6) 4400 256 0x7f1800347b7407e0 7) 4144 208 0xe89f0108f87f0070 8) 3936 272 0xe84100282fa30000 Since we aren't dealing with function descriptors, use %pS instead of %pF to fix it: # cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (26 entries) ----- ---- -------- 0) 5424 112 ftrace_call+0x4/0x8 1) 5312 160 .current_io_context+0x28/0x74 2) 5152 160 .get_io_context+0x48/0xa0 3) 4992 240 .cfq_set_request+0x94/0x4c4 4) 4752 160 .elv_set_request+0x60/0x84 5) 4592 192 .get_request+0x2d4/0x468 6) 4400 256 .get_request_wait+0x7c/0x258 7) 4144 208 .__make_request+0x49c/0x610 8) 3936 272 .generic_make_request+0x390/0x434 Signed-off-by: Anton Blanchard <anton@samba.org> Cc: rostedt@goodmis.org Cc: fweisbec@gmail.com LKML-Reference: <20100825013238.GE28360@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-25 13:08:48 +02:00
Tejun Heo	8a2e8e5dec	workqueue: fix cwq->nr_active underflow cwq->nr_active is used to keep track of how many work items are active for the cpu workqueue, where 'active' is defined as either pending on global worklist or executing. This is used to implement the max_active limit and workqueue freezing. If a work item is queued after nr_active has already reached max_active, the work item doesn't increment nr_active and is put on the delayed queue and gets activated later as previous active work items retire. try_to_grab_pending() which is used in the cancellation path unconditionally decremented nr_active whether the work item being cancelled is currently active or delayed, so cancelling a delayed work item makes nr_active underflow. This breaks max_active enforcement and triggers BUG_ON() in destroy_workqueue() later on. This patch fixes this bug by adding a flag WORK_STRUCT_DELAYED, which is set while a work item in on the delayed list and making try_to_grab_pending() decrement nr_active iff the work item is currently active. The addition of the flag enlarges cwq alignment to 256 bytes which is getting a bit too large. It's scheduled to be reduced back to 128 bytes by merging WORK_STRUCT_PENDING and WORK_STRUCT_CWQ in the next devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Johannes Berg <johannes@sipsolutions.net>	2010-08-25 10:33:56 +02:00
Linus Torvalds	502adf5778	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: watchdog: Don't throttle the watchdog tracing: Fix timer tracing	2010-08-24 12:21:49 -07:00
Linus Torvalds	3b6c5507a6	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mutex: Improve the scalability of optimistic spinning	2010-08-24 12:21:02 -07:00
David Alan Gilbert	bac1e74dba	PM QoS: Fix kzalloc() parameters swapped in pm_qos_power_open() sparse spotted that the kzalloc() in pm_qos_power_open() in the current Linus' git tree had its parameters swapped. Fix this. Signed-off-by: David Alan Gilbert <linux@treblig.org> Acked-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-08-24 20:22:18 +02:00
Tejun Heo	e41e704bc4	workqueue: improve destroy_workqueue() debuggability Now that the worklist is global, having works pending after wq destruction can easily lead to oops and destroy_workqueue() have several BUG_ON()s to catch these cases. Unfortunately, BUG_ON() doesn't tell much about how the work became pending after the final flush_workqueue(). This patch adds WQ_DYING which is set before the final flush begins. If a work is requested to be queued on a dying workqueue, WARN_ON_ONCE() is triggered and the request is ignored. This clearly indicates which caller is trying to queue a work on a dying workqueue and keeps the system working in most cases. Locking rule comment is updated such that the 'I' rule includes modifying the field from destruction path. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-24 18:01:32 +02:00
Namhyung Kim	972fa1c531	workqueue: mark lock acquisition on worker_maybe_bind_and_lock() worker_maybe_bind_and_lock() actually grabs gcwq->lock but was missing proper annotation. Add it. So this patch will remove following sparse warnings: kernel/workqueue.c:1214:13: warning: context imbalance in 'worker_maybe_bind_and_lock' - wrong count at exit arch/x86/include/asm/irqflags.h:44:9: warning: context imbalance in 'worker_rebind_fn' - unexpected unlock kernel/workqueue.c:1991:17: warning: context imbalance in 'rescuer_thread' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-23 11:37:49 +02:00
Namhyung Kim	06bd6ebffa	workqueue: annotate lock context change Some of internal functions called within gcwq->lock context releases and regrabs the lock but were missing proper annotations. Add it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-23 11:37:49 +02:00
Ingo Molnar	a6b9b4d50f	Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu	2010-08-23 11:32:34 +02:00
Tim Chen	9d0f4dcc5c	mutex: Improve the scalability of optimistic spinning There is a scalability issue for current implementation of optimistic mutex spin in the kernel. It is found on a 8 node 64 core Nehalem-EX system (HT mode). The intention of the optimistic mutex spin is to busy wait and spin on a mutex if the owner of the mutex is running, in the hope that the mutex will be released soon and be acquired, without the thread trying to acquire mutex going to sleep. However, when we have a large number of threads, contending for the mutex, we could have the mutex grabbed by other thread, and then another ……, and we will keep spinning, wasting cpu cycles and adding to the contention. One possible fix is to quit spinning and put the current thread on wait-list if mutex lock switch to a new owner while we spin, indicating heavy contention (see the patch included). I did some testing on a 8 socket Nehalem-EX system with a total of 64 cores. Using Ingo's test-mutex program that creates/delete files with 256 threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up after putting in the mutex spin fix: ./mutex-test V 256 10 Ops/sec 2.6.34 62864 With fix 197200 Repeating the test with Aim7 fserver workload, again there is a speed up with the fix: Jobs/min 2.6.34 91657 With fix 149325 To look at the impact on the distribution of mutex acquisition time, I collected the mutex acquisition time on Aim7 fserver workload with some instrumentation. The average acquisition time is reduced by 48% and number of contentions reduced by 32%. #contentions Time to acquire mutex (cycles) 2.6.34 72973 44765791 With fix 49210 23067129 The histogram of mutex acquisition time is listed below. The acquisition time is in 2^bin cycles. We see that without the fix, the acquisition time is mostly around 2^26 cycles. With the fix, we the distribution get spread out a lot more towards the lower cycles, starting from 2^13. However, there is an increase of the tail distribution with the fix at 2^28 and 2^29 cycles. It seems a small price to pay for the reduced average acquisition time and also getting the cpu to do useful work. Mutex acquisition time distribution (acq time = 2^bin cycles): 2.6.34 With Fix bin #occurrence % #occurrence % 11 2 0.00% 120 0.24% 12 10 0.01% 790 1.61% 13 14 0.02% 2058 4.18% 14 86 0.12% 3378 6.86% 15 393 0.54% 4831 9.82% 16 710 0.97% 4893 9.94% 17 815 1.12% 4667 9.48% 18 790 1.08% 5147 10.46% 19 580 0.80% 6250 12.70% 20 429 0.59% 6870 13.96% 21 311 0.43% 1809 3.68% 22 255 0.35% 2305 4.68% 23 317 0.44% 916 1.86% 24 610 0.84% 233 0.47% 25 3128 4.29% 95 0.19% 26 63902 87.69% 122 0.25% 27 619 0.85% 286 0.58% 28 0 0.00% 3536 7.19% 29 0 0.00% 903 1.83% 30 0 0.00% 0 0.00% I've done similar experiments with 2.6.35 kernel on smaller boxes as well. One is on a dual-socket Westmere box (12 cores total, with HT). Another experiment is on an old dual-socket Core 2 box (4 cores total, no HT) On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test program with my mutex patch but no significant difference in aim7's fserver workload. On the 4-core Core 2 box, I see the difference with the patch for both mutex-test and aim7 fserver are negligible. So far, it seems like the patch has not caused regression on smaller systems. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .35.x LKML-Reference: <1282168827.9542.72.camel@schen9-DESK> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-23 10:56:27 +02:00
Peter Zijlstra	c6db67cda7	watchdog: Don't throttle the watchdog Stephane reported that when the machine locks up, the regular ticks, which are responsible to resetting the throttle count, stop too. Hence the NMI watchdog can end up being throttled before it reports on the locked up state, and we end up being sad.. Cure this by having the watchdog overflow reset its own throttle count. Reported-by: Stephane Eranian <eranian@google.com> Tested-by: Stephane Eranian <eranian@google.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282215916.1926.4696.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-23 10:48:05 +02:00
Arjan van de Ven	e36c886a0f	workqueue: Add basic tracepoints to track workqueue execution With the introduction of the new unified work queue thread pools, we lost one feature: It's no longer possible to know which worker is causing the CPU to wake out of idle. The result is that PowerTOP now reports a lot of "kworker/a:b" instead of more readable results. This patch adds a pair of tracepoints to the new workqueue code, similar in style to the timer/hrtimer tracepoints. With this pair of tracepoints, the next PowerTOP can correctly report which work item caused the wakeup (and how long it took): Interrupt (43) i915 time 3.51ms wakeups 141 Work ieee80211_iface_work time 0.81ms wakeups 29 Work do_dbs_timer time 0.55ms wakeups 24 Process Xorg time 21.36ms wakeups 4 Timer sched_rt_period_timer time 0.01ms wakeups 1 Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-21 13:19:37 -07:00
Linus Torvalds	297c5eee37	mm: make the vma list be doubly linked It's a really simple list, and several of the users want to go backwards in it to find the previous vma. So rather than have to look up the previous entry with 'find_vma_prev()' or something similar, just make it doubly linked instead. Tested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-21 08:49:21 -07:00
Dmitry Torokhov	f335397d17	Input: sysrq - drop tty argument form handle_sysrq() Sysrq operations do not accept tty argument anymore so no need to pass it to us. [Stephen Rothwell <sfr@canb.auug.org.au>: fix build breakage in drm code caused by sysrq using bool but not including linux/types.h] [Sachin Sant <sachinp@in.ibm.com>: fix build breakage in s390 keyboadr driver] Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2010-08-21 00:34:45 -07:00
Andrea Righi	b35de43b31	kfifo: implement missing __kfifo_skip_r() kfifo_skip() is currently broken, due to the missing of the internal helper function. Add it. Signed-off-by: Andrea Righi <arighi@develer.com> Cc: Greg KH <greg@kroah.com> Acked-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-20 09:34:54 -07:00
Paul E. McKenney	80dcf60e6b	rcu: apply TINY_PREEMPT_RCU read-side speedup to TREE_PREEMPT_RCU Replace one of the ACCESS_ONCE() calls in each of __rcu_read_lock() and __rcu_read_unlock() with barrier() as suggested by Steve Rostedt in order to avoid the potential compiler-optimization-induced bug noted by Mathieu Desnoyers. Located-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:17 -07:00
Paul E. McKenney	7b0b759b65	rcu: combine duplicate code, courtesy of CONFIG_PREEMPT_RCU The CONFIG_PREEMPT_RCU kernel configuration parameter was recently re-introduced, but as an indication of the type of RCU (preemptible vs. non-preemptible) instead of as selecting a given implementation. This commit uses CONFIG_PREEMPT_RCU to combine duplicate code from include/linux/rcutiny.h and include/linux/rcutree.h into include/linux/rcupdate.h. This commit also combines a few other pieces of duplicate code that have accumulated. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:16 -07:00
Paul E. McKenney	a3dc3fb161	rcu: repair code-duplication FIXMEs Combine the duplicate definitions of ULONG_CMP_GE(), ULONG_CMP_LT(), and rcu_preempt_depth() into include/linux/rcupdate.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:13 -07:00
Paul E. McKenney	53d84e004d	rcu: permit suppressing current grace period's CPU stall warnings When using a kernel debugger, a long sojourn in the debugger can get you lots of RCU CPU stall warnings once you resume. This might not be helpful, especially if you are using the system console. This patch therefore allows RCU CPU stall warnings to be suppressed, but only for the duration of the current set of grace periods. This differs from Jason's original patch in that it adds support for tiny RCU and preemptible RCU, and uses a slightly different method for suppressing the RCU CPU stall warning messages. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-20 09:00:12 -07:00
Paul E. McKenney	8cdd32a918	rcu: refer RCU CPU stall-warning victims to stallwarn.txt There is some documentation on RCU CPU stall warnings contained in Documentation/RCU/stallwarn.txt, but it will not be apparent to someone who runs into such a warning while under time pressure. This commit therefore adds comments preceding the printk()s pointing out the location of this documentation. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:11 -07:00
Paul E. McKenney	a57eb940d1	rcu: Add a TINY_PREEMPT_RCU Implement a small-memory-footprint uniprocessor-only implementation of preemptible RCU. This implementation uses but a single blocked-tasks list rather than the combinatorial number used per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies processing. This version also takes advantage of uniprocessor execution to accelerate grace periods in the case where there are no readers. The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU. This implementation is a step towards having RCU implementation driven off of the SMP and PREEMPT kernel configuration variables, which can happen once this implementation has accumulated sufficient experience. Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as suggested by Steve Rostedt in order to avoid the compiler-reordering issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183). As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte savings compared to CONFIG_TREE_PREEMPT_RCU. Of course, for non-real-time workloads, CONFIG_TINY_RCU is even better. CONFIG_TREE_PREEMPT_RCU text data bss dec filename 13 0 0 13 kernel/rcupdate.o 6170 825 28 7023 kernel/rcutree.o ---- 7026 Total CONFIG_TINY_PREEMPT_RCU text data bss dec filename 13 0 0 13 kernel/rcupdate.o 2081 81 8 2170 kernel/rcutiny.o ---- 2183 Total CONFIG_TINY_RCU (non-preemptible) text data bss dec filename 13 0 0 13 kernel/rcupdate.o 719 25 0 744 kernel/rcutiny.o --- 757 Total Requested-by: Loïc Minier <loic.minier@canonical.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 08:55:00 -07:00
Peter Zijlstra	861d034ee8	sched: Fix rq->clock synchronization when migrating tasks sched_fork() -- we do task placement in ->task_fork_fair() ensure we update_rq_clock() so we work with current time. We leave the vruntime in relative state, so the time delay until wake_up_new_task() doesn't matter. wake_up_new_task() -- Since task_fork_fair() left p->vruntime in relative state we can safely migrate, the activate_task() on the remote rq will call update_rq_clock() and causes the clock to be synced (enough). Tested-by: Jack Daniel <wanders.thirst@gmail.com> Tested-by: Philby John <pjohn@mvista.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1281002322.1923.1708.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-20 14:59:01 +02:00
Lin Ming	277b199800	lockup_detector: Make callback function static watchdog_overflow_callback() is only used in kernel/watchdog.c. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Don Zickus <dzickus@redhat.com> LKML-Reference: <1282273431.16443.32.camel@minggr.sh.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-20 10:09:41 +02:00
Dmitry Torokhov	1495cc9df4	Input: sysrq - drop tty argument from sysrq ops handlers Noone is using tty argument so let's get rid of it. Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2010-08-19 22:07:06 -07:00
Paul E. McKenney	910b1b7e19	rcu: Allow RCU CPU stall warnings to be off at boot, but manually enablable Currently, if RCU CPU stall warnings are enabled, they are enabled immediately upon boot. They can be manually disabled via /sys (and also re-enabled via /sys), and are automatically disabled upon panic. However, some users need RCU CPU stalls to be disabled at boot time, but to be enabled without rebuilding/rebooting. For example, someone running a real-time application in production might not want the additional latency of RCU CPU stall detection in normal operation, but might need to enable it at any point for fault isolation purposes. This commit therefore provides a new CONFIG_RCU_CPU_STALL_DETECTOR_RUNNABLE kernel configuration parameter that maintains the current behavior (enable at boot) by default, but allows a kernel to be configured with RCU CPU stall detection built into the kernel, but disabled at boot time. Requested-by: Clark Williams <williams@redhat.com> Requested-by: John Kacur <jkacur@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-19 17:18:04 -07:00
Paul E. McKenney	f2e0dd7090	rcu: allow RCU CPU stall warning messages to be controlled in /sys Set the permissions of the rcu_cpu_stall_suppress to 644 to enable RCU CPU stall warnings to be enabled and disabled at runtime via sysfs. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-19 17:18:03 -07:00
Paul E. McKenney	77d8485a8b	rcu: improve kerneldoc for rcu_read_lock(), call_rcu(), and synchronize_rcu() Make it explicit that new RCU read-side critical sections that start after call_rcu() and synchronize_rcu() start might still be running after the end of the relevant grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:02 -07:00
Paul E. McKenney	742734eea0	rcu: add boot parameter to suppress RCU CPU stall warning messages Although the RCU CPU stall warning messages are a very good way to alert people to a problem, once alerted, it is sometimes helpful to shut them off in order to avoid obscuring other messages that might be being used to track down the problem. Although you can rebuild the kernel with CONFIG_RCU_CPU_STALL_DETECTOR=n, this is sometimes inconvenient. This commit therefore adds a boot parameter named "rcu_cpu_stall_suppress" that shuts these messages off without requiring a rebuild (though a reboot might be needed for those not brave enough to patch their kernel while it is running). This message-suppression was already in place for the panic case, so this commit need only rename the variable and export it via module_param(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-19 17:18:02 -07:00
Paul E. McKenney	b163760e37	rcu: make CPU stall warning timeout configurable Also set the default to 60 seconds, up from the previous hard-coded timeout of 10 seconds. This allows people who care to set short timeouts, while avoiding people with unusual configurations (make randconfig!!!) from being bothered with spurious CPU stall warnings. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:02 -07:00
Tetsuo Handa	4221a9918e	Add RCU check for find_task_by_vpid(). find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to commit `3120438` "rcu: Disable lockdep checking in RCU list-traversal primitives", we are currently unable to catch "find_task_by_vpid() with tasklist_lock held but RCU lock not held" errors due to the RCU-lockdep checks being suppressed in the RCU variants of the struct list_head traversals. This commit therefore places an explicit check for being in an RCU read-side critical section in find_task_by_pid_ns(). =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/pid.c:386 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 1 lock held by rc.sysinit/1102: #0: (tasklist_lock){.+.+..}, at: [<c1048340>] sys_setpgid+0x40/0x160 stack backtrace: Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1 Call Trace: [<c105e714>] lockdep_rcu_dereference+0x94/0xb0 [<c104b4cd>] find_task_by_pid_ns+0x6d/0x70 [<c104b4e8>] find_task_by_vpid+0x18/0x20 [<c1048347>] sys_setpgid+0x47/0x160 [<c1002b50>] sysenter_do_call+0x12/0x36 Commit updated to use a new rcu_lockdep_assert() exported API rather than the old internal __do_rcu_dereference(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:02 -07:00
Lai Jiangshan	394f99a900	rcu: simplify the usage of percpu data &percpu_data is compatible with allocated percpu data. And we use it and remove the "->rda[NR_CPUS]" array, saving significant storage on systems with large numbers of CPUs. This does add an additional level of indirection and thus an additional cache line referenced, but because ->rda is not used on the read side, this is OK. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:01 -07:00
Lai Jiangshan	e546f485e1	rcutorture: add random preemption Add random preemption to help we to torture the preemptable rcu. srcu_read_delay() also calls rcu_read_delay() for shorter delays. Added comment to preempt_schedule() call indicating that no quiescent states happen if preemption is disabled. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:01 -07:00
Arnd Bergmann	2c392b8c34	cgroups: __rcu annotations Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:00 -07:00
Arnd Bergmann	67bdbffd69	rculist: avoid __rcu annotations This avoids warnings from missing __rcu annotations in the rculist implementation, making it possible to use the same lists in both RCU and non-RCU cases. We can add rculist annotations later, together with lockdep support for rculist, which is missing as well, but that may involve changing all the users. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:18:00 -07:00
Paul E. McKenney	ca5ecddfa8	rcu: define __rcu address space modifier for sparse This commit provides definitions for the __rcu annotation defined earlier. This annotation permits sparse to check for correct use of RCU-protected pointers. If a pointer that is annotated with __rcu is accessed directly (as opposed to via rcu_dereference(), rcu_assign_pointer(), or one of their variants), sparse can be made to complain. To enable such complaints, use the new default-disabled CONFIG_SPARSE_RCU_POINTER kernel configuration option. Please note that these sparse complaints are intended to be a debugging aid, -not- a code-style-enforcement mechanism. There are special rcu_dereference_protected() and rcu_access_pointer() accessors for use when RCU read-side protection is not required, for example, when no other CPU has access to the data structure in question or while the current CPU hold the update-side lock. This patch also updates a number of docbook comments that were showing their age. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christopher Li <sparse@chrisli.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2010-08-19 17:17:59 -07:00
Ingo Molnar	c8710ad389	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-08-19 12:48:09 +02:00
Namhyung Kim	6016ee13db	perf, tracing: add missing __percpu markups ftrace_event_call->perf_events, perf_trace_buf, fgraph_data->cpu_data and some local variables are percpu pointers missing __percpu markups. Add them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> LKML-Reference: <1281498479-28551-1-git-send-email-namhyung@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-08-19 01:33:05 +02:00
Frederic Weisbecker	7ae07ea3a4	perf: Humanize the number of contexts Instead of hardcoding the number of contexts for the recursions barriers, define a cpp constant to make the code more self-explanatory. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com>	2010-08-19 01:32:53 +02:00
Frederic Weisbecker	927c7a9e92	perf: Fix race in callchains Now that software events don't have interrupt disabled anymore in the event path, callchains can nest on any context. So seperating nmi and others contexts in two buffers has become racy. Fix this by providing one buffer per nesting level. Given the size of the callchain entries (2040 bytes * 4), we now need to allocate them dynamically. v2: Fixed put_callchain_entry call after recursion. Fix the type of the recursion, it must be an array. v3: Use a manual pr cpu allocation (temporary solution until NMIs can safely access vmalloc'ed memory). Do a better separation between callchain reference tracking and allocation. Make the "put" path lockless for non-release cases. v4: Protect the callchain buffers with rcu. v5: Do the cpu buffers allocations node affine. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Miller <davem@davemloft.net> Cc: Borislav Petkov <bp@amd64.org>	2010-08-19 01:32:31 +02:00
Frederic Weisbecker	f72c1a931e	perf: Factorize callchain context handling Store the kernel and user contexts from the generic layer instead of archs, this gathers some repetitive code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mackerras <paulus@samba.org> Tested-by: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: David Miller <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Borislav Petkov <bp@amd64.org>	2010-08-19 01:32:11 +02:00
Frederic Weisbecker	56962b4449	perf: Generalize some arch callchain code - Most archs use one callchain buffer per cpu, except x86 that needs to deal with NMIs. Provide a default perf_callchain_buffer() implementation that x86 overrides. - Centralize all the kernel/user regs handling and invoke new arch handlers from there: perf_callchain_user() / perf_callchain_kernel() That avoid all the user_mode(), current->mm checks and so... - Invert some parameters in perf_callchain_*() helpers: entry to the left, regs to the right, following the traditional (dst, src). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mackerras <paulus@samba.org> Tested-by: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: David Miller <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Borislav Petkov <bp@amd64.org>	2010-08-19 01:30:59 +02:00
Linus Torvalds	145c3ae46b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: brlock vfsmount_lock fs: scale files_lock lglock: introduce special lglock and brlock spin locks tty: fix fu_list abuse fs: cleanup files_lock locking fs: remove extra lookup in __lookup_hash fs: fs_struct rwlock to spinlock apparmor: use task path helpers fs: dentry allocation consolidation fs: fix do_lookup false negative mbcache: Limit the maximum number of cache entries hostfs ->follow_link() braino hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy remove SWRITE* I/O types kill BH_Ordered flag vfs: update ctime when changing the file's permission by setfacl cramfs: only unlock new inodes fix reiserfs_evict_inode end_writeback second call	2010-08-18 09:35:08 -07:00
Linus Torvalds	1ca72feb93	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix build on POSIX shells latencytop: Fix kconfig dependency warnings perf annotate tui: Fix exit and RIGHT keys handling tracing: Sanitize value returned from write(trace_marker, "...", len) tracing/events: Convert format output to seq_file tracing: Extend recordmcount to better support Blackfin mcount tracing: Fix ring_buffer_read_page reading out of page boundary tracing: Fix an unallocated memory access in function_graph	2010-08-18 09:32:13 -07:00
Li Zefan	86397dc3cc	tracing: Clean up seqfile code for format file Remove the nasty hack that marks a pointer's LSB to distinguish common fields from event fields. Replace it with a more sane approach. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4C6A23C2.9020606@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-18 11:09:14 -04:00
Nick Piggin	2a4419b5b2	fs: fs_struct rwlock to spinlock fs: fs_struct rwlock to spinlock struct fs_struct.lock is an rwlock with the read-side used to protect root and pwd members while taking references to them. Taking a reference to a path typically requires just 2 atomic ops, so the critical section is very small. Parallel read-side operations would have cacheline contention on the lock, the dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a real parallelism increase. Replace it with a spinlock to avoid one or two atomic operations in typical path lookup fastpath. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-18 08:35:46 -04:00
Linus Torvalds	392abeea52	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: vt,console,kdb: preserve console_blanked while in kdb vt: fix regression warnings from KMS merge arm,kgdb: fix GDB_MAX_REGS no longer used kgdb: add missing __percpu markup in arch/x86/kernel/kgdb.c kdb: fix compile error without CONFIG_KALLSYMS	2010-08-17 18:36:19 -07:00
Daniel J Blueman	f362b73244	Fix unprotected access to task credentials in waitid() Using a program like the following: #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> int main() { id_t id; siginfo_t infop; pid_t res; id = fork(); if (id == 0) { sleep(1); exit(0); } kill(id, SIGSTOP); alarm(1); waitid(P_PID, id, &infop, WCONTINUED); return 0; } to call waitid() on a stopped process results in access to the child task's credentials without the RCU read lock being held - which may be replaced in the meantime - eliciting the following warning: =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/exit.c:1460 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 2 locks held by waitid02/22252: #0: (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310 #1: (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>] wait_consider_task+0x19a/0xbe0 stack backtrace: Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3 Call Trace: [<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0 [<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0 [<ffffffff81061d15>] do_wait+0xf5/0x310 [<ffffffff810620b6>] sys_waitid+0x86/0x1f0 [<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70 [<ffffffff81003282>] system_call_fastpath+0x16/0x1b This is fixed by holding the RCU read lock in wait_task_continued() to ensure that the task's current credentials aren't destroyed between us reading the cred pointer and us reading the UID from those credentials. Furthermore, protect wait_task_stopped() in the same way. We don't need to keep holding the RCU read lock once we've read the UID from the credentials as holding the RCU read lock doesn't stop the target task from changing its creds under us - so the credentials may be outdated immediately after we've read the pointer, lock or no lock. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-17 18:07:43 -07:00
David Howells	d7627467b7	Make do_execve() take a const filename pointer Make do_execve() take a const filename pointer so that kernel_execve() compiles correctly on ARM: arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type This also requires the argv and envp arguments to be consted twice, once for the pointer array and once for the strings the array points to. This is because do_execve() passes a pointer to the filename (now const) to copy_strings_kernel(). A simpler alternative would be to cast the filename pointer in do_execve() when it's passed to copy_strings_kernel(). do_execve() may not change any of the strings it is passed as part of the argv or envp lists as they are some of them in .rodata, so marking these strings as const should be fine. Further kernel_execve() and sys_execve() need to be changed to match. This has been test built on x86_64, frv, arm and mips. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-17 18:07:43 -07:00
John Kacur	6a103b0d44	lockup detector: Fix grammar by adding a missing "to" in the comments This fixes a minor grammar problem in the comments in hung_task.c Signed-off-by: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1281021054-4228-2-git-send-email-jkacur@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-17 09:11:52 +02:00
John Kacur	f1b499f029	lockdep: Remove __debug_show_held_locks There is no longer any functional difference between __debug_show_held_locks() and debug_show_held_locks(), so remove the former. Signed-off-by: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1281021054-4228-1-git-send-email-jkacur@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-17 09:11:10 +02:00
Jason Wessel	b590cddfa6	kdb: fix compile error without CONFIG_KALLSYMS If CONFIG_KGDB_KDB is set and CONFIG_KALLSYMS is not set the kernel will fail to build with the error: kernel/built-in.o: In function `kallsyms_symbol_next': kernel/debug/kdb/kdb_support.c:237: undefined reference to `kdb_walk_kallsyms' kernel/built-in.o: In function `kallsyms_symbol_complete': kernel/debug/kdb/kdb_support.c:193: undefined reference to `kdb_walk_kallsyms' The kdb_walk_kallsyms needs a #ifdef proper header to match the C implementation. This patch also fixes the compiler warnings in kdb_support.c when compiling without CONFIG_KALLSYMS set. The compiler warnings are a result of the kallsyms_lookup() macro not initializing the two of the pass by reference variables. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Reported-by: Michal Simek <monstr@monstr.eu>	2010-08-16 15:58:29 -05:00
Steven Rostedt	d244b6bd41	Merge branch 'tip/perf/urgent-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into trace/tip/perf/urgent-4 Conflicts: kernel/trace/trace_events.c Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-16 11:17:30 -04:00
Xiaotian Feng	8d9df9f084	workqueue: free rescuer on destroy_workqueue wq->rescuer is not freed when wq is destroyed, leads a memory leak then. This patch also remove a redundant line. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>	2010-08-16 09:55:01 +02:00
Marcin Slusarz	1aa54bca6e	tracing: Sanitize value returned from write(trace_marker, "...", len) When userspace code writes non-new-line-terminated string to trace_marker file, write handler appends new-line and returns number of bytes written to trace buffer, so write(fd, "abc", 3) will return 4 That's unexpected and unfortunately it confuses glibc's fprintf function. Example: int main() { fprintf(stderr, "abc"); return 0; } $ gcc test.c -o test $ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer $ ./test 2>/sys/kernel/debug/tracing/trace_marker results in infinite loop: write(fd, "abc", 3) = 4 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 (...) ...and kernel trace buffer full of empty markers. Fix it by sanitizing write return value. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> LKML-Reference: <20100727231801.GB2826@joi.lan> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-13 15:23:16 -04:00
John Stultz	c7dcf87a68	time: Workaround gcc loop optimization that causes 64bit div errors Early 4.3 versions of gcc apparently aggressively optimize the raw time accumulation loop, replacing it with a divide. On 32bit systems, this causes the following link errors: undefined reference to `__umoddi3' undefined reference to `__udivdi3' The gcc issue has been fixed in 4.4 and greater. This patch replaces the accumulation loop with a do_div, as suggested by Linus. Signed-off-by: John Stultz <johnstul@us.ibm.com> CC: Jason Wessel <jason.wessel@windriver.com> CC: Larry Finger <Larry.Finger@lwfinger.net> CC: Ingo Molnar <mingo@elte.hu> CC: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-13 12:03:24 -07:00
Linus Torvalds	2069601b3f	Revert "fsnotify: store struct file not struct path" This reverts commit `3bcf3860a4` (and the accompanying commit `c1e5c95402` "vfs/fsnotify: fsnotify_close can delay the final work in fput" that was a horribly ugly hack to make it work at all). The 'struct file' approach not only causes that disgusting hack, it somehow breaks pulseaudio, probably due to some other subtlety with f_count handling. Fix up various conflicts due to later fsnotify work. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-12 14:23:04 -07:00
Steven Rostedt	2a37a3df57	tracing/events: Convert format output to seq_file Two new events were added that broke the current format output. Both from the SCSI system: scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout The reason is that their print_fmt exceeded a page size. Since the output of the format used simple_read_from_buffer and trace_seq, it was limited to a page size in output. This patch converts the printing of the format of an event into seq_file, which allows greater than a page size to be shown. I diffed all event formats comparing the output with and without this patch. All matched except for the above two, which showed just: FORMAT TOO BIG without this patch, but now properly displays the output with this patch. v2: Remove updating *pos in seq start function. [ Thanks to Li Zefan for pointing that out ] Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Cc: James Bottomley <James.Bottomley@suse.de> Cc: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-12 16:59:29 -04:00
Linus Torvalds	26df0766a7	Merge branch 'params' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'params' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (22 commits) param: don't deref arg in __same_type() checks param: update drivers/acpi/debug.c to new scheme param: use module_param in drivers/message/fusion/mptbase.c ide: use module_param_named rather than module_param_call param: update drivers/char/ipmi/ipmi_watchdog.c to new scheme param: lock if_sdio's lbs_helper_name and lbs_fw_name against sysfs changes. param: lock myri10ge_fw_name against sysfs changes. param: simple locking for sysfs-writable charp parameters param: remove unnecessary writable charp param: add kerneldoc to moduleparam.h param: locking for kernel parameters param: make param sections const. param: use free hook for charp (fix leak of charp parameters) param: add a free hook to kernel_param_ops. param: silence .init.text references from param ops Add param ops struct for hvc_iucv driver. nfs: update for module_param_named API change AppArmor: update for module_param_named API change param: use ops in struct kernel_param, rather than get and set fns directly param: move the EXPORT_SYMBOL to after the definitions. ...	2010-08-12 10:01:59 -07:00
Jason Wessel	deda2e8196	timekeeping: Fix overflow in rawtime tv_nsec on 32 bit archs The tv_nsec is a long and when added to the shifted interval it can wrap and become negative which later causes looping problems in the getrawmonotonic(). The edge case occurs when the system has slept for a short period of time of ~2 seconds. A trace printk of the values in this patch illustrate the problem: ftrace time stamp: log 43.716079: logarithmic_accumulation: raw: 3d0913 tv_nsec d687faa 43.718513: logarithmic_accumulation: raw: 3d0913 tv_nsec da588bd 43.722161: logarithmic_accumulation: raw: 3d0913 tv_nsec de291d0 46.349925: logarithmic_accumulation: raw: 7a122600 tv_nsec `e1f9ae3` 46.349930: logarithmic_accumulation: raw: 1e848980 tv_nsec 8831c0e3 The kernel starts looping at 46.349925 in the getrawmonotonic() due to the negative value from adding the raw value to tv_nsec. A simple solution is to accumulate into a u64, and then normalize it to a timespec_t. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> [ Reworked variable names and simplified some of the code. - John ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-12 09:53:39 -07:00
David Howells	12fdff3fc2	Add a dummy printk function for the maintenance of unused printks Add a dummy printk function for the maintenance of unused printks through gcc format checking, and also so that side-effect checking is maintained too. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-12 09:51:35 -07:00
Adrian Hunter	8d57a98ccd	block: add secure discard Secure discard is the same as discard except that all copies of the discarded sectors (perhaps created by garbage collection) must also be erased. Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Madhusudhan Chikkature <madhu.cr@ti.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Gardiner <bengardiner@nanometrics.ca> Cc: <linux-mmc@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-12 08:43:30 -07:00
Stefani Seibold	d78a3eda69	kernel/kfifo.c: add handling of chained scatterlists The current kfifo scatterlist implementation will not work with chained scatterlists. It assumes that struct scatterlist arrays are allocated contiguously, which is not the case when chained scatterlists (struct sg_table) are in use. Signed-off-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-12 08:43:29 -07:00
Linus Torvalds	5af568cbd5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: isofs: Fix lseek() to position beyond 4 GB vfs: remove unused MNT_STRICTATIME vfs: show unreachable paths in getcwd and proc vfs: only add " (deleted)" where necessary vfs: add prepend_path() helper vfs: __d_path: dont prepend the name of the root dentry ia64: perfmon: add d_dname method vfs: add helpers to get root and pwd cachefiles: use path_get instead of lone dget fs/sysv/super.c: add support for non-PDP11 v7 filesystems V7: Adjust sanity checks for some volumes Add v7 alias v9fs: fixup for inode_setattr being removed Manual merge to take Al's version of the fs/sysv/super.c file: it merged cleanly, but Al had removed an unnecessary header include, so his side was better.	2010-08-11 09:23:32 -07:00
Stefani Seibold	2e956fb320	kfifo: replace the old non generic API Simply replace the whole kfifo.c and kfifo.h files with the new generic version and fix the kerneldoc API template file. Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:23 -07:00
Stefani Seibold	4201d9a8e8	kfifo: add the new generic kfifo API Add the new version of the kfifo API files kfifo.c and kfifo.h. Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:23 -07:00
Dan Carpenter	f65a03f6ab	kexec: return -EFAULT on copy_to_user() failures copy_to/from_user() returns the number of bytes remaining to be copied. It never returns a negative value. The correct return code is -EFAULT and not -EIO. All the callers check for non-zero returns so that's Ok, but the return code is passed to the user so we should fix this. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Simon Kagstrom <simon.kagstrom@netinsight.net> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:22 -07:00
Anton Blanchard	863a604920	lib/bug.c: add oops end marker to WARN implementation We are missing the oops end marker for the exception based WARN implementation in lib/bug.c. This is useful for logfile analysis tools. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@infradead.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:22 -07:00
TAMUKI Shoichi	c7ff0d9c92	panic: keep blinking in spite of long spin timer mode To keep panic_timeout accuracy when running under a hypervisor, the current implementation only spins on long time (1 second) calls to mdelay. That brings a good effect, but the problem is the keyboard LEDs don't blink at all on that situation. This patch changes to call to panic_blink_enter() between every mdelay and keeps blinking in spite of long spin timer mode. The time to call to mdelay is now 100ms. Even this change will keep panic_timeout accuracy enough when running under a hypervisor. Signed-off-by: TAMUKI Shoichi <tamuki@linet.gr.jp> Cc: Ben Dooks <ben-linux@fluff.org> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:22 -07:00
Oleg Nesterov	c52b0b91ba	pids: alloc_pidmap: remove the unnecessary boundary checks alloc_pidmap() calculates max_scan so that if the initial offset != 0 we inspect the first map->page twice. This is correct, we want to find the unused bits < offset in this bitmap block. Add the comment. But it doesn't make any sense to stop the find_next_offset() loop when we are looking into this map->page for the second time. We have already already checked the bits >= offset during the first attempt, it is fine to do this again, no matter if we succeed this time or not. Remove this hard-to-understand code. It optimizes the very unlikely case when we are going to fail, but slows down the more likely case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Salman Qazi <sqazi@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:20 -07:00
Salman	5fdee8c4a5	pids: fix a race in pid generation that causes pids to be reused immediately A program that repeatedly forks and waits is susceptible to having the same pid repeated, especially when it competes with another instance of the same program. This is really bad for bash implementation. Furthermore, many shell scripts assume that pid numbers will not be used for some length of time. Race Description: A B // pid == offset == n // pid == offset == n + 1 test_and_set_bit(offset, map->page) test_and_set_bit(offset, map->page); pid_ns->last_pid = pid; pid_ns->last_pid = pid; // pid == n + 1 is freed (wait()) // Next fork()... last = pid_ns->last_pid; // == n pid = last + 1; Code to reproduce it (Running multiple instances is more effective): #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> // The distance mod 32768 between two pids, where the first pid is expected // to be smaller than the second. int PidDistance(pid_t first, pid_t second) { return (second + 32768 - first) % 32768; } int main(int argc, char* argv[]) { int failed = 0; pid_t last_pid = 0; int i; printf("%d\n", sizeof(pid_t)); for (i = 0; i < 10000000; ++i) { if (i % 32786 == 0) printf("Iter: %d\n", i/32768); int child_exit_code = i % 256; pid_t pid = fork(); if (pid == -1) { fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno); exit(1); } if (pid == 0) { // Child exit(child_exit_code); } else { // Parent if (i > 0) { int distance = PidDistance(last_pid, pid); if (distance == 0 \|\| distance > 30000) { fprintf(stderr, "Unexpected pid sequence: previous fork: pid=%d, " "current fork: pid=%d for iteration=%d.\n", last_pid, pid, i); failed = 1; } } last_pid = pid; int status; int reaped = wait(&status); if (reaped != pid) { fprintf(stderr, "Wait return value: expected pid=%d, " "got %d, iteration %d\n", pid, reaped, i); failed = 1; } else if (WEXITSTATUS(status) != child_exit_code) { fprintf(stderr, "Unexpected exit status %x, iteration %d\n", WEXITSTATUS(status), i); failed = 1; } } } exit(failed); } Thanks to Ted Tso for the key ideas of this implementation. Signed-off-by: Salman Qazi <sqazi@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:20 -07:00
Oleg Nesterov	c7e49c1488	ptrace: optimize exit_ptrace() for the likely case exit_ptrace() takes tasklist_lock unconditionally. We need this lock to avoid the race with ptrace_traceme(), it acts as a barrier. Change its caller, forget_original_parent(), to call exit_ptrace() under tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if needed. This allows us to add the fastpath list_empty(ptraced) check. In the likely no-tracees case exit_ptrace() just returns and we avoid the lock() + unlock() sequence. "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> suggested to add this check, and he reports that this change adds about 11% improvement in some tests. Suggested-and-tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:19 -07:00
Dan Carpenter	e400c28524	cgroups: save space for the terminator The original code didn't leave enough space for a NULL terminator. These strings are copied with strcpy() into fixed length buffers in cgroup_root_from_opts(). Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ben Blum <bblum@andrew.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-11 08:59:18 -07:00
Rusty Russell	907b29eb41	param: locking for kernel parameters There may be cases (most obviously, sysfs-writable charp parameters) where a module needs to prevent sysfs access to parameters. Rather than express this in terms of a big lock, the functions are expressed in terms of what they protect against. This is clearer, esp. if the implementation changes to a module-level or even param-level lock. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Takashi Iwai <tiwai@suse.de> Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2010-08-11 23:04:20 +09:30
Rusty Russell	914dcaa84c	param: make param sections const. Since this section can be read-only (they're in .rodata), they should always have been const. Minor flow-through various functions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2010-08-11 23:04:19 +09:30
Rusty Russell	a1054322af	param: use free hook for charp (fix leak of charp parameters) Instead of using a "I kmalloced this" flag, we keep track of the kmalloced strings and use that list to check if we need to kfree (in practice, the list is very short). This means that kparams can be const again, and plugs a leak. This is important for drivers/usb/gadget/nokia.c which gets modprobe/rmmod'ed frequently on the N9000. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Takashi Iwai <tiwai@suse.de> Cc: Artem Bityutskiy <dedekind1@gmail.com> Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2010-08-11 23:04:18 +09:30
Rusty Russell	e6df34a442	param: add a free hook to kernel_param_ops. This allows us to generalize the KPARAM_KMALLOCED flag, by calling a function on every parameter when a module is unloaded. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Takashi Iwai <tiwai@suse.de> Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2010-08-11 23:04:18 +09:30
Rusty Russell	9bbb9e5a33	param: use ops in struct kernel_param, rather than get and set fns directly This is more kernel-ish, saves some space, and also allows us to expand the ops without breaking all the callers who are happy for the new members to be NULL. The few places which defined their own param types are changed to the new scheme (more which crept in recently fixed in following patches). Since we're touching them anyway, we change get() and set() to take a const struct kernel_param (which they really are). This causes some harmless warnings until we fix them (in following patches). To reduce churn, module_param_call creates the ops struct so the callers don't have to change (and casts the functions to reduce warnings). The modern version which takes an ops struct is called module_param_cb. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Takashi Iwai <tiwai@suse.de> Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ville Syrjala <syrjala@sci.fi> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Alessandro Rubini <rubini@ipvvis.unipv.it> Cc: Michal Januszewski <spock@gentoo.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: linux-kernel@vger.kernel.org Cc: linux-input@vger.kernel.org Cc: linux-fbdev-devel@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: netdev@vger.kernel.org	2010-08-11 23:04:13 +09:30
Rusty Russell	a14fe249a8	param: move the EXPORT_SYMBOL to after the definitions. This is modern style, and good to do before we start changing things. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Takashi Iwai <tiwai@suse.de> Tested-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2010-08-11 23:04:12 +09:30
Rusty Russell	2e9fb9953d	params: don't hand NULL values to param.set callbacks. An audit by Dongdong Deng revealed that most driver-author-written param calls don't handle val == NULL (which happens when parameters are specified with no =, eg "foo" instead of "foo=1"). The only real case to use this is boolean, so handle it specially for that case and remove a source of bugs for everyone else. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Dongdong Deng <dongdong.deng@windriver.com> Cc: Américo Wang <xiyou.wangcong@gmail.com>	2010-08-11 23:04:11 +09:30
Jiri Kosina	6396fc3b3f	Merge branch 'master' into for-next Conflicts: fs/exofs/inode.c	2010-08-11 09:36:51 +02:00
Miklos Szeredi	f7ad3c6be9	vfs: add helpers to get root and pwd Add three helpers that retrieve a refcounted copy of the root and cwd from the supplied fs_struct. get_fs_root() get_fs_pwd() get_fs_root_and_pwd() Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-11 00:28:20 -04:00
Randy Dunlap	0caa621065	kernel/timer.c: fix kernel-doc function parameter warning Fix kernel-doc warning, add @timer description: Warning(kernel/timer.c:335): No description found for parameter 'timer' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-10 15:33:09 -07:00
Linus Torvalds	2f9e825d3e	Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits) block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n xen-blkfront: fix missing out label blkdev: fix blkdev_issue_zeroout return value block: update request stacking methods to support discards block: fix missing export of blk_types.h writeback: fix bad _bh spinlock nesting drbd: revert "delay probes", feature is being re-implemented differently drbd: Initialize all members of sync_conf to their defaults [Bugz 315] drbd: Disable delay probes for the upcomming release writeback: cleanup bdi_register writeback: add new tracepoints writeback: remove unnecessary init_timer call writeback: optimize periodic bdi thread wakeups writeback: prevent unnecessary bdi threads wakeups writeback: move bdi threads exiting logic to the forker thread writeback: restructure bdi forker loop a little writeback: move last_active to bdi writeback: do not remove bdi from bdi_list writeback: simplify bdi code a little writeback: do not lose wake-ups in bdi threads ... Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and drivers/scsi/scsi_error.c as per Jens.	2010-08-10 15:22:42 -07:00
Linus Torvalds	b34d8915c4	Merge branch 'writable_limits' of git://decibel.fi.muni.cz/~xslaby/linux * 'writable_limits' of git://decibel.fi.muni.cz/~xslaby/linux: unistd: add __NR_prlimit64 syscall numbers rlimits: implement prlimit64 syscall rlimits: switch more rlimit syscalls to do_prlimit rlimits: redo do_setrlimit to more generic do_prlimit rlimits: add rlimit64 structure rlimits: do security check under task_lock rlimits: allow setrlimit to non-current tasks rlimits: split sys_setrlimit rlimits: selinux, do rlimits changes under task_lock rlimits: make sure ->rlim_max never grows in sys_setrlimit rlimits: add task_struct to update_rlimit_cpu rlimits: security, add task_struct to setrlimit Fix up various system call number conflicts. We not only added fanotify system calls in the meantime, but asm-generic/unistd.h added a wait4 along with a range of reserved per-architecture system calls.	2010-08-10 12:07:51 -07:00
Linus Torvalds	8c8946f509	Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify * 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits) fanotify: use both marks when possible fsnotify: pass both the vfsmount mark and inode mark fsnotify: walk the inode and vfsmount lists simultaneously fsnotify: rework ignored mark flushing fsnotify: remove global fsnotify groups lists fsnotify: remove group->mask fsnotify: remove the global masks fsnotify: cleanup should_send_event fanotify: use the mark in handler functions audit: use the mark in handler functions dnotify: use the mark in handler functions inotify: use the mark in handler functions fsnotify: send fsnotify_mark to groups in event handling functions fsnotify: Exchange list heads instead of moving elements fsnotify: srcu to protect read side of inode and vfsmount locks fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called fsnotify: use _rcu functions for mark list traversal fsnotify: place marks on object in order of group memory address vfs/fsnotify: fsnotify_close can delay the final work in fput fsnotify: store struct file not struct path ... Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.	2010-08-10 11:39:13 -07:00
Linus Torvalds	5f248c9c25	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c	2010-08-10 11:26:52 -07:00
Jiri Kosina	fb8231a8b1	Merge branch 'master' into for-next Conflicts: arch/arm/mach-omap1/board-nokia770.c	2010-08-10 13:22:08 +02:00
Andi Kleen	8c4af38e9b	gcc-4.6: printk: use stable variable to dump kmsg buffer kmsg_dump takes care to sample the global variables inside a spinlock, but then goes on to use the same variables outside the spinlock region too. Use the correct variable. This will make the race window smaller. Found by gcc 4.6's new warnings. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:45:06 -07:00
Richard Kennedy	878ae12749	stop_machine: struct cpu_stopper, remove alignment padding on 64 bits Reorder elements in structure cpu_stopper to remove alignment padding on 64 bit builds, this shrinks its size from 40 to 32 bytes saving 8 bytes per cpu. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:45:06 -07:00
Geert Uytterhoeven	459b37d423	kernel/range: remove unused definition of ARRAY_SIZE() Remove duplicate definition of ARRAY_SIZE(), which was never used anyway. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:45:06 -07:00
Oleg Nesterov	2ee7c922f2	sys_personality: remove the bogus checks in sys_personality()->__set_personality() path Cleanup, no functional changes. - __set_personality() always changes ->exec_domain/personality, the special case when ->exec_domain remains the same buys nothing but complicates the code. Unify both cases to simplify the code. - The -EINVAL check in sys_personality() was never right. If we assume that set_personality() can fail we should check the value it returns instead of verifying that task->personality was actually changed. Remove it. Before the previous patch it was possible to hit this case due to overflow problems, but this -EINVAL just indicated the kernel bug. OTOH, probably it makes sense to change lookup_exec_domain() to return ERR_PTR() instead of default_exec_domain if the search in exec_domains list fails, and report this error to the user-space. But this means another user-space change, and we have in-kernel users which need fixes. For example, PER_OSF4 falls into PER_MASK for unkown reason and nobody cares to register this domain. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Wenming Zhang <wezhang@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:45:05 -07:00
KAMEZAWA Hiroyuki	d2997b1042	hibernation: freeze swap at hibernation When taking a memory snapshot in hibernate_snapshot(), all (directly called) memory allocations use GFP_ATOMIC. Hence swap misusage during hibernation never occurs. But from a pessimistic point of view, there is no guarantee that no page allcation has __GFP_WAIT. It is better to have a global indication "we enter hibernation, don't use swap!". This patch tries to freeze new-swap-allocation during hibernation. (All user processes are frozenm so swapin is not a concern). This way, no updates will happen to swap_map[] between hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free() is called. We can be assured that swap corruption will not occur. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ondrej Zary <linux@rainbow-software.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:45:04 -07:00
David Rientjes	a63d83f427	oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:45:02 -07:00
David Rientjes	8e4228e1ed	oom: move sysctl declarations to oom.h The three oom killer sysctl variables (sysctl_oom_dump_tasks, sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better declared in include/linux/oom.h rather than kernel/sysctl.c. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 20:44:57 -07:00
Christoph Hellwig	ebabe9a900	pass a struct path to vfs_statfs We'll need the path to implement the flags field for statvfs support. We do have it available in all callers except: - ecryptfs_statfs. This one doesn't actually need vfs_statfs but just needs to do a caller to the lower filesystem statfs method. - sys_ustat. Add a non-exported statfs_by_dentry helper for it which doesn't won't be able to fill out the flags field later on. In addition rename the helpers for statfs vs fstatfs to do_*statfs instead of the misleading vfs prefix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-08-09 16:48:42 -04:00
Tejun Heo	f6500947a9	workqueue: workqueue_cpu_callback() should be cpu_notifier instead of hotcpu_notifier Commit `6ee0578b` (workqueue: mark init_workqueues as early_initcall) made workqueue SMP initialization depend on workqueue_cpu_callback(), which however was registered as hotcpu_notifier() and didn't get called if CONFIG_HOTPLUG_CPU is not set. This made gcwqs on non-boot CPUs not create their initial workers leading to boot failures. Fix it by making it a cpu_notifier. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-bisected-by: walt <w41ter@gmail.com> Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>	2010-08-09 11:50:34 +02:00
Paul Bolle	426d31071a	fix printk typo 'faild' Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-08-09 11:25:17 +02:00
Namhyung Kim	38f5156800	workqueue: add missing __percpu markup in kernel/workqueue.c works in schecule_on_each_cpu() is a percpu pointer but was missing __percpu markup. Add it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-08 14:24:09 +02:00
Linus Torvalds	78417334b5	Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: do_coredump: Do not take BKL init: Remove the BKL from startup code	2010-08-07 17:06:54 -07:00
Linus Torvalds	3b7433b8a8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits) workqueue: mark init_workqueues() as early_initcall() workqueue: explain for_each_cwq_cpu() iterators fscache: fix build on !CONFIG_SYSCTL slow-work: kill it gfs2: use workqueue instead of slow-work drm: use workqueue instead of slow-work cifs: use workqueue instead of slow-work fscache: drop references to slow-work fscache: convert operation to use workqueue instead of slow-work fscache: convert object to use workqueue instead of slow-work workqueue: fix how cpu number is stored in work->data workqueue: fix mayday_mask handling on UP workqueue: fix build problem on !CONFIG_SMP workqueue: fix locking in retry path of maybe_create_worker() async: use workqueue for worker pool workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead workqueue: implement unbound workqueue workqueue: prepare for WQ_UNBOUND implementation libata: take advantage of cmwq and remove concurrency limitations workqueue: fix worker management invocation without pending works ... Fixed up conflicts in fs/cifs/ as per Tejun. Other trivial conflicts in include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c	2010-08-07 12:42:58 -07:00
Arnd Bergmann	62c2a7d969	block: push BKL into blktrace ioctls The blktrace driver currently needs the BKL, but we should not need to take that in the block layer, so just push it down into the driver itself. It is quite likely that the BKL is not actually required in blktrace code and could be removed in a follow-on patch. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:26:08 +02:00
Christoph Hellwig	7b6d91daee	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:20:39 +02:00
Christoph Hellwig	33659ebbae	block: remove wrappers for request type/flags Remove all the trivial wrappers for the cmd_type and cmd_flags fields in struct requests. This allows much easier grepping for different request types instead of unwinding through macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:17:56 +02:00
Linus Torvalds	1787985782	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: xen: Do not suspend IPI IRQs. powerpc: Use IRQF_NO_SUSPEND not IRQF_TIMER for non-timer interrupts ixp4xx-beeper: Use IRQF_NO_SUSPEND not IRQF_TIMER for non-timer interrupt irq: Add new IRQ flag IRQF_NO_SUSPEND	2010-08-06 13:25:43 -07:00
Linus Torvalds	b62ad9ab18	Merge branch 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: um: Fix read_persistent_clock fallout kgdb: Do not access xtime directly powerpc: Clean up obsolete code relating to decrementer and timebase powerpc: Rework VDSO gettimeofday to prevent time going backwards clocksource: Add __clocksource_updatefreq_hz/khz methods x86: Convert common clocksources to use clocksource_register_hz/khz timekeeping: Make xtime and wall_to_monotonic static hrtimer: Cleanup direct access to wall_to_monotonic um: Convert to use read_persistent_clock timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset powerpc: Cleanup xtime usage powerpc: Simplify update_vsyscall time: Kill off CONFIG_GENERIC_TIME time: Implement timespec_add x86: Fix vtime/file timestamp inconsistencies Trivial conflicts in Documentation/feature-removal-schedule.txt Much less trivial conflicts in arch/powerpc/kernel/time.c resolved as per Thomas' earlier merge commit `47916be4e2` ("Merge branch 'powerpc.cherry-picks' into timers/clocksource")	2010-08-06 13:18:29 -07:00
Linus Torvalds	af39008435	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Documentation: Add timers/timers-howto.txt timer: Added usleep_range timer Revert "timer: Added usleep[_range] timer" clockevents: Remove the per cpu tick skew posix_timer: Move copy_to_user(created_timer_id) down in timer_create() timer: Added usleep[_range] timer timers: Document meaning of deferrable timer	2010-08-06 13:12:36 -07:00
Linus Torvalds	ab69bcd66f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (28 commits) driver core: device_rename's new_name can be const sysfs: Remove owner field from sysfs struct attribute powerpc/pci: Remove owner field from attribute initialization in PCI bridge init regulator: Remove owner field from attribute initialization in regulator core driver leds: Remove owner field from attribute initialization in bd2802 driver scsi: Remove owner field from attribute initialization in ARCMSR driver scsi: Remove owner field from attribute initialization in LPFC driver cgroupfs: create /sys/fs/cgroup to mount cgroupfs on Driver core: Add BUS_NOTIFY_BIND_DRIVER driver core: fix memory leak on one error path in bus_register() debugfs: no longer needs to depend on SYSFS sysfs: Fix one more signature discrepancy between sysfs implementation and docs. sysfs: fix discrepancies between implementation and documentation dcdbas: remove a redundant smi_data_buf_free in dcdbas_exit dmi-id: fix a memory leak in dmi_id_init error path sysfs: sysfs_chmod_file's attr can be const firmware: Update hotplug script Driver core: move platform device creation helpers to .init.text (if MODULE=n) Driver core: reduce duplicated code for platform_device creation Driver core: use kmemdup in platform_device_add_resources ...	2010-08-06 11:36:30 -07:00
Huang Ying	18fab912d4	tracing: Fix ring_buffer_read_page reading out of page boundary With the configuration: CONFIG_DEBUG_PAGEALLOC=y and Shaohua's patch: [PATCH]x86: make spurious_fault check correct pte bit Function call graph trace with the following will trigger a page fault. # cd /sys/kernel/debug/tracing/ # echo function_graph > current_tracer # cat per_cpu/cpu1/trace_pipe_raw > /dev/null BUG: unable to handle kernel paging request at ffff880006e99000 IP: [<ffffffff81085572>] rb_event_length+0x1/0x3f PGD 1b19063 PUD 1b1d063 PMD 3f067 PTE 6e99160 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/virtual/net/lo/operstate CPU 1 Modules linked in: Pid: 1982, comm: cat Not tainted 2.6.35-rc6-aes+ #300 /Bochs RIP: 0010:[<ffffffff81085572>] [<ffffffff81085572>] rb_event_length+0x1/0x3f RSP: 0018:ffff880006475e38 EFLAGS: 00010006 RAX: 0000000000000ff0 RBX: ffff88000786c630 RCX: 000000000000001d RDX: ffff880006e98000 RSI: 0000000000000ff0 RDI: ffff880006e99000 RBP: ffff880006475eb8 R08: 000000145d7008bd R09: 0000000000000000 R10: 0000000000008000 R11: ffffffff815d9336 R12: ffff880006d08000 R13: ffff880006e605d8 R14: 0000000000000000 R15: 0000000000000018 FS: 00007f2b83e456f0(0000) GS:ffff880002100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff880006e99000 CR3: 00000000064a8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process cat (pid: 1982, threadinfo ffff880006474000, task ffff880006e40770) Stack: ffff880006475eb8 ffffffff8108730f 0000000000000ff0 000000145d7008bd <0> ffff880006e98010 ffff880006d08010 0000000000000296 ffff88000786c640 <0> ffffffff81002956 0000000000000000 ffff8800071f4680 ffff8800071f4680 Call Trace: [<ffffffff8108730f>] ? ring_buffer_read_page+0x15a/0x24a [<ffffffff81002956>] ? return_to_handler+0x15/0x2f [<ffffffff8108a575>] tracing_buffers_read+0xb9/0x164 [<ffffffff810debfe>] vfs_read+0xaf/0x150 [<ffffffff81002941>] return_to_handler+0x0/0x2f [<ffffffff810248b0>] __bad_area_nosemaphore+0x17e/0x1a1 [<ffffffff81002941>] return_to_handler+0x0/0x2f [<ffffffff810248e6>] bad_area_nosemaphore+0x13/0x15 Code: 80 25 b2 16 b3 00 fe c9 c3 55 48 89 e5 f0 80 0d a4 16 b3 00 02 c9 c3 55 31 c0 48 89 e5 48 83 3d 94 16 b3 00 01 c9 0f 94 c0 c3 55 <8a> 0f 48 89 e5 83 e1 1f b8 08 00 00 00 0f b6 d1 83 fa 1e 74 27 RIP [<ffffffff81085572>] rb_event_length+0x1/0x3f RSP <ffff880006475e38> CR2: ffff880006e99000 ---[ end trace a6877bb92ccb36bb ]--- The root cause is that ring_buffer_read_page() may read out of page boundary, because the boundary checking is done after reading. This is fixed via doing boundary checking before reading. Reported-by: Shaohua Li <shaohua.li@intel.com> Cc: <stable@kernel.org> Signed-off-by: Huang Ying <ying.huang@intel.com> LKML-Reference: <1280297641.2771.307.camel@yhuang-dev> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-06 14:34:45 -04:00
Linus Torvalds	c4efd6b569	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug sched: No need for bootmem special cases sched: Revert nohz_ratelimit() for now sched: Reduce update_group_power() calls sched: Update rq->clock for nohz balanced cpus sched: Fix spelling of sibling sched, cpuset: Drop __cpuexit from cpu hotplug callbacks sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand() sched: thread_group_cputime: Simplify, document the "alive" check sched: Remove the obsolete exit_state/signal hacks sched: task_tick_rt: Remove the obsolete ->signal != NULL check sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless sched: Fix comments to make them DocBook happy sched: Fix fix_small_capacity powerpc: Exclude arch_sd_sibiling_asym_packing() on UP powerpc: Enable asymmetric SMT scheduling on POWER7 sched: Add asymmetric group packing option for sibling domain sched: Fix capacity calculations for SMT4 sched: Change nohz idle load balancing logic to push model ...	2010-08-06 09:39:22 -07:00
Linus Torvalds	4aed2fd8e3	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits) tracing/kprobes: unregister_trace_probe needs to be called under mutex perf: expose event__process function perf events: Fix mmap offset determination perf, powerpc: fsl_emb: Restore setting perf_sample_data.period perf, powerpc: Convert the FSL driver to use local64_t perf tools: Don't keep unreferenced maps when unmaps are detected perf session: Invalidate last_match when removing threads from rb_tree perf session: Free the ref_reloc_sym memory at the right place x86,mmiotrace: Add support for tracing STOS instruction perf, sched migration: Librarize task states and event headers helpers perf, sched migration: Librarize the GUI class perf, sched migration: Make the GUI class client agnostic perf, sched migration: Make it vertically scrollable perf, sched migration: Parameterize cpu height and spacing perf, sched migration: Fix key bindings perf, sched migration: Ignore unhandled task states perf, sched migration: Handle ignored migrate out events perf: New migration tool overview tracing: Drop cpparg() macro perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call ... Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c	2010-08-06 09:30:52 -07:00
Linus Torvalds	3a3527b646	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Revert "net: Make accesses to ->br_port safe for sparse RCU" mce: convert to rcu_dereference_index_check() net: Make accesses to ->br_port safe for sparse RCU vfs: add fs.h to define struct file lockdep: Add an in_workqueue_context() lockdep-based test function rcu: add __rcu API for later sparse checking rcu: add an rcu_dereference_index_check() tree/tiny rcu: Add debug RCU head objects mm: remove all rcu head initializations fs: remove all rcu head initializations, except on_stack initializations powerpc: remove all rcu head initializations	2010-08-06 09:23:07 -07:00
Shaohua Li	575570f027	tracing: Fix an unallocated memory access in function_graph With CONFIG_DEBUG_PAGEALLOC, I observed an unallocated memory access in function_graph trace. It appears we find a small size entry in ring buffer, but we access it as a big size entry. The access overflows the page size and touches an unallocated page. Cc: <stable@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> LKML-Reference: <1280217994.32400.76.camel@sli10-desk.sh.intel.com> [ Added a comment to explain the problem - SDR ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-06 12:19:15 -04:00
Linus Torvalds	9779714c8a	Merge branch 'kms-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'kms-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb,docs: Update the kgdb docs to include kms drm_fb_helper: Preserve capability to use atomic kms i915: when kgdb is active display compression should be off drm/i915: use new fb debug hooks drm: add KGDB/KDB support fb: add hooks to handle KDB enter/exit kgdboc: Add call backs to allow kernel mode switching vt,console,kdb: automatically set kdb LINES variable vt,console,kdb: implement atomic console enter/leave functions	2010-08-05 16:00:44 -07:00
Linus Torvalds	89a6c8cb9e	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: debug_core,kdb: fix crash when arch does not have single step kgdb,x86: use macro HBP_NUM to replace magic number 4 kgdb,mips: remove unused kgdb_cpu_doing_single_step operations mm,kdb,kgdb: Add a debug reference for the kdb kmap usage KGDB: Remove set but unused newPC ftrace,kdb: Allow dumping a specific cpu's buffer with ftdump ftrace,kdb: Extend kdb to be able to dump the ftrace buffer kgdb,powerpc: Replace hardcoded offset by BREAK_INSTR_SIZE arm,kgdb: Add ability to trap into debugger on notify_die gdbstub: do not directly use dbg_reg_def[] in gdb_cmd_reg_set() gdbstub: Implement gdbserial 'p' and 'P' packets kgdb,arm: Individual register get/set for arm kgdb,mips: Individual register get/set for mips kgdb,x86: Individual register get/set for x86 kgdb,kdb: individual register set and and get API gdbstub: Optimize kgdb's "thread:" response for the gdb serial protocol kgdb: remove custom hex_to_bin()implementation	2010-08-05 15:59:48 -07:00
Greg KH	676db4af04	cgroupfs: create /sys/fs/cgroup to mount cgroupfs on We really shouldn't be asking userspace to create new root filesystems. So follow along with all of the other in-kernel filesystems, and provide a mount point in sysfs. For cgroupfs, this should be in /sys/fs/cgroup/ This change provides that mount point when the cgroup filesystem is registered in the kernel. Acked-by: Paul Menage <menage@google.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-08-05 13:53:35 -07:00
Ian Abbott	94f17cd788	hotplug: Support kernel/hotplug sysctl variable when !CONFIG_NET The kernel/hotplug sysctl variable (/proc/sys/kernel/hotplug file) was made conditional on CONFIG_NET by commit `f743ca5e10` (applied in 2.6.18) to fix problems with undefined references in 2.6.16 when CONFIG_HOTPLUG=y && !CONFIG_NET, but this restriction is no longer needed. This patch makes the kernel/hotplug sysctl variable depend only on CONFIG_HOTPLUG. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Acked-by: Randy Dunlap <randy.dunlap@oracle.COM> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-08-05 13:53:33 -07:00
Linus Torvalds	90d3417a3a	Merge branch 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: cleanup comments, remove noinline module: group post-relocation functions into post_relocation() module: move module args strndup_user to just before use module: pass load_info into other functions module: fix sysfs cleanup for !CONFIG_SYSFS module: sysfs cleanup module: layout_and_allocate module: fix crash in get_ksymbol() when oopsing in module init module: kallsyms functions take struct load_info module: refactor out section header rewriting: FIX modversions module: refactor out section header rewriting module: add load_info module: reduce stack usage for each_symbol() module: refactor load_module part 5 module: refactor load_module part 4 module: refactor load_module part 3 module: refactor load_module part 2 module: refactor load_module module: module_unload_init() cleanup	2010-08-05 13:49:55 -07:00
Linus Torvalds	cdd854bc42	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (79 commits) powerpc/8xx: Add support for the MPC8xx based boards from TQC powerpc/85xx: Introduce support for the Freescale P1022DS reference board powerpc/85xx: Adding DTS for the STx GP3-SSA MPC8555 board powerpc/85xx: Change deprecated binding for 85xx-based boards powerpc/tqm85xx: add a quirk for ti1520 PCMCIA bridge powerpc/tqm85xx: update PCI interrupt-map attribute powerpc/mpc8308rdb: support for MPC8308RDB board from Freescale powerpc/fsl_pci: add quirk for mpc8308 pcie bridge powerpc/85xx: Cleanup QE initialization for MPC85xxMDS boards powerpc/85xx: Fix booting for P1021MDS boards powerpc/85xx: Fix SWIOTLB initalization for MPC85xxMDS boards powerpc/85xx: kexec for SMP 85xx BookE systems powerpc/5200/i2c: improve i2c bus error recovery of/xilinxfb: update tft compatible versions powerpc/fsl-diu-fb: Support setting display mode using EDID powerpc/5121: doc/dts-bindings: update doc of FSL DIU bindings powerpc/5121: shared DIU framebuffer support powerpc/5121: move fsl-diu-fb.h to include/linux powerpc/5121: fsl-diu-fb: fix issue with re-enabling DIU area descriptor powerpc/512x: add clock structure for Video-IN (VIU) unit ...	2010-08-05 09:03:46 -07:00
Linus Torvalds	c3d1f1746b	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (150 commits) MIPS: PowerTV: Separate PowerTV USB support from non-USB code MIPS: strip the un-needed sections of vmlinuz MIPS: Clean up the calculation of VMLINUZ_LOAD_ADDRESS MIPS: Clean up arch/mips/boot/compressed/decompress.c MIPS: Clean up arch/mips/boot/compressed/ld.script MIPS: Unify the suffix of compressed vmlinux.bin MIPS: PowerTV: Add Gaia platform definitions. MIPS: BCM47xx: Fix nvram_getenv return value. MIPS: Octeon: Allow more than 3.75GB of memory with PCIe MIPS: Clean up notify_die() usage. MIPS: Remove unused task_struct.trap_no field. Documentation: Mention that KProbes is supported on MIPS SAMPLES: kprobe_example: Make it print something on MIPS. MIPS: kprobe: Add support. MIPS: Add instrunction format for BREAK and SYSCALL MIPS: kprobes: Define regs_return_value() MIPS: Ritually kill stupid printk. MIPS: Octeon: Disallow MSI-X interrupt and fall back to MSI interrupts. MIPS: Octeon: Support 256 MSI on PCIe MIPS: Decode core number for R2 CPUs. ...	2010-08-05 08:53:20 -07:00
Jason Wessel	81d4450732	vt,console,kdb: automatically set kdb LINES variable The kernel console interface stores the number of lines it is configured to use. The kdb debugger can greatly benefit by knowing how many lines there are on the console for the pager functionality without having the end user compile in the setting or have to repeatedly change it at run time. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> CC: David Airlie <airlied@linux.ie> CC: Andrew Morton <akpm@linux-foundation.org>	2010-08-05 09:22:30 -05:00
Jason Wessel	3fa43aba08	debug_core,kdb: fix crash when arch does not have single step When an arch such as mips and microblaze does not implement either HW or software single stepping the debug core should re-enter kdb. The kdb code will properly ignore the single step operation. Attempting to single step the kernel without software or hardware support causes unpredictable kernel crashes. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:25 -05:00
Jason Wessel	19063c776f	ftrace,kdb: Allow dumping a specific cpu's buffer with ftdump In systems with more than one processor it is desirable to look at the per cpu trace buffers. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com>	2010-08-05 09:22:23 -05:00
Jason Wessel	955b61e597	ftrace,kdb: Extend kdb to be able to dump the ftrace buffer Add in a helper function to allow the kdb shell to dump the ftrace buffer. Modify trace.c to expose the capability to iterate over the ftrace buffer in a read only capacity. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com>	2010-08-05 09:22:23 -05:00
Jason Wessel	6d855b1d83	gdbstub: do not directly use dbg_reg_def[] in gdb_cmd_reg_set() Presently the usable registers definitions on x86 are not contiguous for kgdb. The x86 kgdb uses a case statement for the sparse register accesses. The array which defines the registers (dbg_reg_def) should not be used directly in order to safely work with sparse register definitions. Specifically there was a problem when gdb accesses ORIG_AX, which is accessed only through the case statement. This patch encodes register memory using the size information provided from the debugger which avoids the need to look up the size of the register. The dbg_set_reg() function always further validates the inputs from the debugger. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>	2010-08-05 09:22:22 -05:00
Jason Wessel	55751145dc	gdbstub: Implement gdbserial 'p' and 'P' packets The gdbserial 'p' and 'P' packets allow gdb to individually get and set registers instead of querying for all the available registers. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:21 -05:00
Jason Wessel	534af10823	kgdb,kdb: individual register set and and get API The kdb shell specification includes the ability to get and set architecture specific registers by name. For the time being individual register get and set will be implemented on a per architecture basis. If an architecture defines DBG_MAX_REG_NUM > 0 then kdb and the gdbstub will use the capability for individually getting and setting architecture specific registers. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:20 -05:00
Jason Wessel	84a0bd5b28	gdbstub: Optimize kgdb's "thread:" response for the gdb serial protocol The gdb debugger understands how to parse short versions of the thread reference string as long as the bytes are paired in sets of two characters. The kgdb implementation was always sending 8 leading zeros which could be omitted, and further optimized in the case of non-negative thread numbers. The negative numbers are used to reference a specific cpu in the case of kgdb. An example of the previous i386 stop packet looks like: T05thread:00000000000003bb; New stop packet response: T05thread:03bb; The previous ThreadInfo response looks like: m00000000fffffffe,0000000000000001,0000000000000002,0000000000000003,0000000000000004,0000000000000005,0000000000000006,0000000000000007,000000000000000c,0000000000000088,000000000000008a,000000000000008b,000000000000008c,000000000000008d,000000000000008e,00000000000000d4,00000000000000d5,00000000000000dd New ThreadInfo response: mfffffffe,01,02,03,04,05,06,07,0c,88,8a,8b,8c,8d,8e,d4,d5,dd A few bytes saved means better response time when using kgdb over a serial line. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:19 -05:00
Andy Shevchenko	a9fa20a7af	kgdb: remove custom hex_to_bin()implementation Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:19 -05:00
Kevin Cernekee	034260d677	printk: fix delayed messages from CPU hotplug events When a secondary CPU is being brought up, it is not uncommon for printk() to be invoked when cpu_online(smp_processor_id()) == 0. The case that I witnessed personally was on MIPS: http://lkml.org/lkml/2010/5/30/4 If (can_use_console() == 0), printk() will spool its output to log_buf and it will be visible in "dmesg", but that output will NOT be echoed to the console until somebody calls release_console_sem() from a CPU that is online. Therefore, the boot time messages from the new CPU can get stuck in "limbo" for a long time, and might suddenly appear on the screen when a completely unrelated event (e.g. "eth0: link is down") occurs. This patch modifies the console code so that any pending messages are automatically flushed out to the console whenever a CPU hotplug operation completes successfully or aborts. The issue was seen on 2.6.34. Original patch by Kevin Cernekee with cleanups by akpm and additional fixes by Santosh Shilimkar. This patch superseeds https://patchwork.linux-mips.org/patch/1357/. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> To: <mingo@elte.hu> To: <akpm@linux-foundation.org> To: <simon.kagstrom@netinsight.net> To: <David.Woodhouse@intel.com> To: <lethal@linux-sh.org> Cc: <linux-kernel@vger.kernel.org> Cc: <linux-mips@linux-mips.org> Reviewed-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Patchwork: https://patchwork.linux-mips.org/patch/1534/ LKML-Reference: <ede63b5a20af951c755736f035d1e787772d7c28@localhost> LKML-Reference: <EAF47CD23C76F840A9E7FCE10091EFAB02C5DB6D1F@dbde02.ent.ti.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2010-08-05 13:25:59 +01:00
Ingo Molnar	0bcfe75807	Merge branch 'sched/urgent' into sched/core Conflicts: include/linux/sched.h Merge reason: Add the leftover .35 urgent bits, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-05 09:46:29 +02:00
Ingo Molnar	fc9ea5a1e5	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-08-05 08:46:15 +02:00
Ingo Molnar	61be7fdec2	Merge branch 'perf/nmi' into perf/core Conflicts: kernel/Makefile Merge reason: Add the now complete topic, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-05 08:45:05 +02:00
Rusty Russell	51f3d0f474	module: cleanup comments, remove noinline On my (32-bit x86) machine, sys_init_module() uses 124 bytes of stack once load_module() is inlined. This effectively reverts `ffb4ba76` which inlined it due to stack pressure. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:13 +09:30
Rusty Russell	811d66a0e1	module: group post-relocation functions into post_relocation() This simply hoists more code out of load_module; we also put the identification of the extable and dynamic debug table in with the others in find_module_sections(). We move the taint check to the actual add/remove of the dynamic debug info: this is certain (find_module_sections is too early). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-08-05 12:59:13 +09:30
Rusty Russell	6526c534b2	module: move module args strndup_user to just before use Instead of copying and allocating the args and storing it in load_info, we can just allocate them right before we need them. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:12 +09:30
Rusty Russell	49668688dd	module: pass load_info into other functions Pass the struct load_info into all the other functions in module loading. This neatens things and makes them more consistent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:10 +09:30
Rusty Russell	36b0360d17	module: fix sysfs cleanup for !CONFIG_SYSFS Restore the stub module_remove_modinfo_attrs, remove the now-unused !CONFIG_SYSFS module_sysfs_init. Also, rename mod_kobject_remove() to mod_sysfs_teardown() as it is the logical counterpart to mod_sysfs_setup now. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:10 +09:30
Rusty Russell	8f6d037815	module: sysfs cleanup We change the sysfs functions to take struct load_info, and call them all in mod_sysfs_setup(). We also clean up the #ifdefs a little. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:09 +09:30
Rusty Russell	d913188c75	module: layout_and_allocate layout_and_allocate() does everything up to and including the final struct module placement inside the allocated module memory. We have to store the symbol layout information in our struct load_info though. This avoids the nasty code we had before where 'mod' pointed first to the version inside the temporary allocation containing the entire file, then later was moved to point to the real struct module: now the main code only ever sees the final module address. (Includes fix for the Tony Luck-found Linus-diagnosed failure path error). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:09 +09:30
Rusty Russell	511ca6ae43	module: fix crash in get_ksymbol() when oopsing in module init Andrew had the sole pleasure of tickling this bug in linux-next; when we set up "info->strtab" it's pointing into the temporary copy of the module. For most uses that is fine, but kallsyms keeps a pointer around during module load (inside mod->strtab). If we oops for some reason inside a module's init function, kallsyms will use the mod->strtab pointer into the now-freed temporary module copy. (Later oopses work fine: after init we overwrite mod->strtab to point to a compacted core-only strtab). Reported-by: Andrew "Grumpy" Morton <akpm@linux-foundation.org> Signed-off-by: Rusty "Buggy" Russell <rusty@rustcorp.com.au> Tested-by: Andrew "Happy" Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:08 +09:30
Rusty Russell	eded41c1c6	module: kallsyms functions take struct load_info Simple refactor causes us to lift struct definition to top of file. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:08 +09:30
Rusty Russell	d6df72a06e	module: refactor out section header rewriting: FIX modversions We can't do the find_sec after removing the SHF_ALLOC flags; it won't find the sections. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:07 +09:30
Rusty Russell	8b5f61a795	module: refactor out section header rewriting Put all the "rewrite and check section headers" in one place. This adds another iteration over the sections, but it's far clearer. We iterate once for every find_section() so we already iterate over many times. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:06 +09:30
Linus Torvalds	3264d3f9dd	module: add load_info Btw, here's a patch that _looks_ large, but it really pretty trivial, and sets things up so that it would be way easier to split off pieces of the module loading. The reason it looks large is that it creates a "module_info" structure that contains all the module state that we're building up while loading, instead of having individual variables for all the indices etc. So the patch ends up being large, because every "symindex" access instead becomes "info.index.sym" etc. That may be a few characters longer, but it then means that we can just pass a pointer to that "info" structure around. and let all the pieces fill it in very naturally. As an example of that, the patch also moves the initialization of all those convenience variables into a "setup_module_info()" function. And at this point it really does become very natural to start to peel off some of the error labels and move them into the helper functions - now the "truncated" case is gone, and is handled inside that setup function instead. So maybe you don't like this approach, and it does make the variable accesses a bit longer, but I don't think unreadably so. And the patch really does look big and scary, but there really should be absolutely no semantic changes - most of it was a trivial and mindless rename. In fact, it was so mindless that I on purpose kept the existing helper functions looking like this: - err = check_modinfo(mod, sechdrs, infoindex, versindex); + err = check_modinfo(mod, info.sechdrs, info.index.info, info.index.vers); rather than changing them to just take the "info" pointer. IOW, a second phase (if you think the approach is ok) would change that calling convention to just do err = check_modinfo(mod, &info); (and same for "layout_sections()", "layout_symtabs()" etc.) Similarly, while right now it makes things _look_ bigger, with things like this: versindex = find_sec(hdr, sechdrs, secstrings, "__versions"); becoming info->index.vers = find_sec(info->hdr, info->sechdrs, info->secstrings, "__versions"); in the new "setup_module_info()" function, that's again just a result of it being a search-and-replace patch. By using the 'info' pointer, we could just change the 'find_sec()' interface so that it ends up being info->index.vers = find_sec(info, "__versions"); instead, and then we'd actually have a shorter and more readable line. So for a lot of those mindless variable name expansions there's would be room for separate cleanups. I didn't move quite everything in there - if we do this to layout_symtabs, for example, we'd want to move the percpu, symoffs, stroffs, *strmap variables to be fields in that module_info structure too. But that's a much smaller patch, I moved just the really core stuff that is currently being set up and used in various parts. But even in this rough form, it removes close to 70 lines from that function (but adds 22 lines overall, of course - the structure definition, the helper function declarations and call-sites etc etc). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:06 +09:30
Linus Torvalds	44032e6316	module: reduce stack usage for each_symbol() And now that I'm looking at that call-chain (to see if it would make sense to use some other more specific lock - doesn't look like it: all the readers are using RCU and this is the only writer), I also give you this trivial one-liner. It changes each_symbol() to not put that constant array on the stack, resulting in changing movq $C.388.31095, %rsi #, tmp85 subq $376, %rsp #, movq %rdi, %rbx # fn, fn leaq -208(%rbp), %rdi #, tmp84 movq %rbx, %rdx # fn, rep movsl xorl %esi, %esi # leaq -208(%rbp), %rdi #, tmp87 movq %r12, %rcx # data, call each_symbol_in_section.clone.0 # into xorl %esi, %esi # subq $216, %rsp #, movq %rdi, %rbx # fn, fn movq $arr.31078, %rdi #, call each_symbol_in_section.clone.0 # which is not so much about being obviously shorter and simpler because we don't unnecessarily copy that constant array around onto the stack, but also about having a much smaller stack footprint (376 vs 216 bytes - see the update of 'rsp'). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:06 +09:30
Rusty Russell	22e268ebec	module: refactor load_module part 5 1) Extract out the relocation loop into apply_relocations 2) Extract license and version checks into check_module_license_and_versions 3) Extract icache flushing into flush_module_icache 4) Move __obsparm warning into find_module_sections 5) Move license setting into check_modinfo. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:05 +09:30
Rusty Russell	9f85a4bbb1	module: refactor load_module part 4 Allocate references inside module_unload_init(), clean up inside module_unload_free(). This version fixed to do allocation before __this_cpu_write, thanks to bug reports from linux-next from Dave Young <hidave.darkstar@gmail.com> and Stephen Rothwell <sfr@canb.auug.org.au>. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:05 +09:30
Rusty Russell	40dd2560ec	module: refactor load_module part 3 Extract out the allocation and copying in from userspace, and the first set of modinfo checks. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:04 +09:30
Linus Torvalds	65b8a9b4d5	module: refactor load_module part 2 Here's a second one. It's slightly less trivial - since we now have error cases - and equally untested so it may well be totally broken. But it also cleans up a bit more, and avoids one of the goto targets, because the "move_module()" helper now does both allocations or none. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:03 +09:30
Linus Torvalds	f91a13bb99	module: refactor load_module I'd start from the trivial stuff. There's a fair amount of straight-line code that just makes the function hard to read just because you have to page up and down so far. Some of it is trivial to just create a helper function for. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:02 +09:30
Eric Dumazet	2409e74278	module: module_unload_init() cleanup No need to clear mod->refptr in module_unload_init(), since alloc_percpu() already clears allocated chunks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed unused var)	2010-08-05 12:59:02 +09:30
Linus Torvalds	3cfc2c42c1	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (48 commits) Documentation: update broken web addresses. fix comment typo "choosed" -> "chosen" hostap:hostap_hw.c Fix typo in comment Fix spelling contorller -> controller in comments Kconfig.debug: FAIL_IO_TIMEOUT: typo Faul -> Fault fs/Kconfig: Fix typo Userpace -> Userspace Removing dead MACH_U300_BS26 drivers/infiniband: Remove unnecessary casts of private_data fs/ocfs2: Remove unnecessary casts of private_data libfc: use ARRAY_SIZE scsi: bfa: use ARRAY_SIZE drm: i915: use ARRAY_SIZE drm: drm_edid: use ARRAY_SIZE synclink: use ARRAY_SIZE block: cciss: use ARRAY_SIZE comment typo fixes: charater => character fix comment typos concerning "challenge" arm: plat-spear: fix typo in kerneldoc reiserfs: typo comment fix update email address ...	2010-08-04 15:31:02 -07:00
Linus Torvalds	b7c8e55db7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (39 commits) random: Reorder struct entropy_store to remove padding on 64bits padata: update API documentation padata: Remove padata_get_cpumask crypto: pcrypt - Update pcrypt cpumask according to the padata cpumask notifier crypto: pcrypt - Rename pcrypt_instance padata: Pass the padata cpumasks to the cpumask_change_notifier chain padata: Rearrange set_cpumask functions padata: Rename padata_alloc functions crypto: pcrypt - Dont calulate a callback cpu on empty callback cpumask padata: Check for valid cpumasks padata: Allocate cpumask dependend recources in any case padata: Fix cpu index counting crypto: geode_aes - Convert pci_table entries to PCI_VDEVICE (if PCI_ANY_ID is used) pcrypt: Added sysfs interface to pcrypt padata: Added sysfs primitives to padata subsystem padata: Make two separate cpumasks padata: update documentation padata: simplify serialization mechanism padata: make padata_do_parallel to return zero on success padata: Handle empty padata cpumasks ...	2010-08-04 15:23:14 -07:00
Linus Torvalds	6ba74014c1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits) phy/marvell: add 88ec048 support igb: Program MDICNFG register prior to PHY init e1000e: correct MAC-PHY interconnect register offset for 82579 hso: Add new product ID can: Add driver for esd CAN-USB/2 device l2tp: fix export of header file for userspace can-raw: Fix skb_orphan_try handling Revert "net: remove zap_completion_queue" net: cleanup inclusion phy/marvell: add 88e1121 interface mode support u32: negative offset fix net: Fix a typo from "dev" to "ndev" igb: Use irq_synchronize per vector when using MSI-X ixgbevf: fix null pointer dereference due to filter being set for VLAN 0 e1000e: Fix irq_synchronize in MSI-X case e1000e: register pm_qos request on hardware activation ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice net: Add getsockopt support for TCP thin-streams cxgb4: update driver version cxgb4: add new PCI IDs ... Manually fix up conflicts in: - drivers/net/e1000e/netdev.c: due to pm_qos registration infrastructure changes - drivers/net/phy/marvell.c: conflict between adding 88ec048 support and cleaning up the IDs - drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req conflict (registration change vs marking it static)	2010-08-04 11:47:58 -07:00
David Howells	694f690d27	CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks Commit `8f92054e7c` ("CRED: Fix __task_cred()'s lockdep check and banner comment") fixed the lockdep checks on __task_cred(). This has shown up a place in the signalling code where a lock should be held - namely that check_kill_permission() requires its callers to hold the RCU lock. Fix group_send_sig_info() to get the RCU read lock around its call to check_kill_permission(). Without this patch, the following warning can occur: =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/signal.c:660 invoked rcu_dereference_check() without protection! ... Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-04 11:17:10 -07:00
Linus Torvalds	f46e9913fa	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Runtime: Add runtime PM statistics (v3) PM / Runtime: Make runtime_status attribute not debug-only (v. 2) PM: Do not use dynamically allocated objects in pm_wakeup_event() PM / Suspend: Fix ordering of calls in suspend error paths PM / Hibernate: Fix snapshot error code path PM / Hibernate: Fix hibernation_platform_enter() pm_qos: Get rid of the allocation in pm_qos_add_request() pm_qos: Reimplement using plists plist: Add plist_last PM: Make it possible to avoid races between wakeup and system sleep PNPACPI: Add support for remote wakeup PM: describe kernel policy regarding wakeup defaults (v. 2) PM / Hibernate: Fix typos in comments in kernel/power/swap.c	2010-08-04 11:14:36 -07:00
Srikar Dronamraju	9da79ab83e	tracing/kprobes: unregister_trace_probe needs to be called under mutex Comment in unregister_trace_probe() says probe_lock will be held when it gets called. However there is a case where it might called without the probe_lock being held. Also since we are traversing the probe_list and deleting an element from the probe_list, probe_lock should be held. This was first pointed in uprobes traceevent review by Frederic Weisbecker here. (http://lkml.org/lkml/2010/5/12/106) Cc: Ingo Molnar <mingo@elte.hu> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100630084548.GA10325@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-08-04 12:41:23 -03:00
Jiri Kosina	d790d4d583	Merge branch 'master' into for-next	2010-08-04 15:14:38 +02:00
Patrick Pannuto	5e7f5a178b	timer: Added usleep_range timer usleep_range is a finer precision implementations of msleep and is designed to be a drop-in replacement for udelay where a precise sleep / busy-wait is unnecessary. Since an easy interface to hrtimers could lead to an undesired proliferation of interrupts, we provide only a "range" API, forcing the caller to think about an acceptable tolerance on both ends and hopefully avoiding introducing another interrupt. INTRO As discussed here ( http://lkml.org/lkml/2007/8/3/250 ), msleep(1) is not precise enough for many drivers (yes, sleep precision is an unfair notion, but consistently sleeping for ~an order of magnitude greater than requested is worth fixing). This patch adds a usleep API so that udelay does not have to be used. Obviously not every udelay can be replaced (those in atomic contexts or being used for simple bitbanging come to mind), but there are many, many examples of mydriver_write(...) /* Wait for hardware to latch / udelay(100) in various drivers where a busy-wait loop is neither beneficial nor necessary, but msleep simply does not provide enough precision and people are using a busy-wait loop instead. CONCERNS FROM THE RFC Why is udelay a problem / necessary? Most callers of udelay are in device/ driver initialization code, which is serial... As I see it, there is only benefit to sleeping over a delay; the notion of "refactoring" areas that use udelay was presented, but I see usleep as the refactoring. Consider i2c, if the bus is busy, you need to wait a bit (say 100us) before trying again, your current options are: udelay(100) * msleep(1) <-- As noted above, actually as high as ~20ms on some platforms, so not really an option * Manually set up an hrtimer to try again in 100us (which is what usleep does anyway...) People choose the udelay route because it is EASY; we need to provide a better easy route. Device / driver / boot code is currently serial, but every few months someone makes noise about parallelizing boot, and IMHO, a little forward-thinking now is one less thing to worry about if/when that ever happens udelay's could be preempted Sure, but if udelay plans on looping 1000 times, and it gets preempted on loop 200, whenever it's scheduled again, it is going to do the next 800 loops. Is the interruptible case needed? Probably not, but I see usleep as a very logical parallel to msleep, so it made sense to include the "full" API. Processors are getting faster (albeit not as quickly as they are becoming more parallel), so if someone wanted to be interruptible for a few usecs, why not let them? If this is a contentious point, I'm happy to remove it. OTHER THOUGHTS I believe there is also value in exposing the usleep_range option; it gives the scheduler a lot more flexibility and allows the programmer to express his intent much more clearly; it's something I would hope future driver writers will take advantage of. To get the results in the NUMBERS section below, I literally s/udelay/usleep the kernel tree; I had to go in and undo the changes to the USB drivers, but everything else booted successfully; I find that extremely telling in and of itself -- many people are using a delay API where a sleep will suit them just fine. SOME ATTEMPTS AT NUMBERS It turns out that calculating quantifiable benefit on this is challenging, so instead I will simply present the current state of things, and I hope this to be sufficient: How many udelay calls are there in 2.6.35-rc5? udealy(ARG) >= \| COUNT 1000 \| 319 500 \| 414 100 \| 1146 20 \| 1832 I am working on Android, so that is my focus for this. The following table is a modified usleep that simply printk's the amount of time requested to sleep; these tests were run on a kernel with udelay >= 20 --> usleep "boot" is power-on to lock screen "power collapse" is when the power button is pushed and the device suspends "resume" is when the power button is pushed and the lock screen is displayed (no touchscreen events or anything, just turning on the display) "use device" is from the unlock swipe to clicking around a bit; there is no sd card in this phone, so fail loading music, video, camera ACTION \| TOTAL NUMBER OF USLEEP CALLS \| NET TIME (us) boot \| 22 \| 1250 power-collapse \| 9 \| 1200 resume \| 5 \| 500 use device \| 59 \| 7700 The most interesting category to me is the "use device" field; 7700us of busy-wait time that could be put towards better responsiveness, or at the least less power usage. Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> Cc: apw@canonical.com Cc: corbet@lwn.net Cc: arjan@linux.intel.com Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-08-04 11:00:45 +02:00
Thomas Gleixner	e1b004c3ef	Revert "timer: Added usleep[_range] timer" This reverts commit `22b8f15c2f` to merge an advanced version. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-08-04 10:53:00 +02:00
Benjamin Herrenschmidt	412a4ac5e9	Merge commit 'gcl/next' into next	2010-08-04 10:26:03 +10:00
Jesse Barnes	8cadd2831b	timer: add on-stack deferrable timer interfaces In some cases (for instance with kernel threads) it may be desireable to use on-stack deferrable timers to get their power saving benefits. Add interfaces to support this for the IPS driver. Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Matthew Garrett <mjg@redhat.com>	2010-08-03 09:48:45 -04:00
Arjan van de Ven	af5ab277de	clockevents: Remove the per cpu tick skew Historically, Linux has tried to make the regular timer tick on the various CPUs not happen at the same time, to avoid contention on xtime_lock. Nowadays, with the tickless kernel, this contention no longer happens since time keeping and updating are done differently. In addition, this skew is actually hurting power consumption in a measurable way on many-core systems. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20100727210210.58d3118c@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-08-02 21:45:58 +02:00
Ingo Molnar	3772b73472	Merge commit 'v2.6.35' into perf/core Conflicts: tools/perf/Makefile tools/perf/util/hist.c Merge reason: Resolve the conflicts and update to latest upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-02 08:31:54 +02:00
Frederic Weisbecker	669336e4cf	perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call We use synchronize_sched() to ensure a tracepoint won't be called while/after we release the perf buffers it references. But the tracepoint API has its own API for that: tracepoint_synchronize_unregister(). Use it instead as it's self-explanatory and eases maintainance. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-08-02 01:30:56 +02:00
Suresh Siddha	6ee0578b4d	workqueue: mark init_workqueues() as early_initcall() Mark init_workqueues() as early_initcall() and thus it will be initialized before smp bringup. init_workqueues() registers for the hotcpu notifier and thus it should cope with the processors that are brought online after the workqueues are initialized. x86 smp bringup code uses workqueues and uses a workaround for the cold boot process (as the workqueues are initialized post smp_init()). Marking init_workqueues() as early_initcall() will pave the way for cleaning up this code. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-08-01 13:05:29 +02:00
Tejun Heo	098849516d	workqueue: explain for_each_cwq_cpu() iterators for_each_cwq_cpu() are similar to regular CPU iterators except that it also considers the pseudo CPU number used for unbound workqueues. Explain them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-08-01 11:50:12 +02:00
Steffen Klassert	0500e9b3f1	padata: Remove padata_get_cpumask A function that copies the padata cpumasks to a user buffer is a bit error prone. The cpumask can change any time so we can't be sure to have the right cpumask when using this function. A user who is interested in the padata cpumasks should register to the padata cpumask notifier chain instead. Users of padata_get_cpumask are already updated, so we can remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:06 +08:00
Steffen Klassert	c635696c7c	padata: Pass the padata cpumasks to the cpumask_change_notifier chain We pass a pointer to the new padata cpumasks to the cpumask_change_notifier chain. So users can access the cpumasks without the need of an extra padata_get_cpumask function. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:05 +08:00
Steffen Klassert	65ff577e6b	padata: Rearrange set_cpumask functions padata_set_cpumask needs to be protected by a lock. We make __padata_set_cpumasks unlocked and static. So this function can be used by the exported and locked padata_set_cpumask and padata_set_cpumasks functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:04 +08:00
Steffen Klassert	e6cc117076	padata: Rename padata_alloc functions We rename padata_alloc to padata_alloc_possible because this function allocates a padata_instance and uses the cpu_possible mask for parallel and serial workers. Also we rename __padata_alloc to padata_alloc to avoid to export underlined functions. Underlined functions are considered to be private to padata. Users are updated accordingly. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:04 +08:00
David Howells	de09a9771a	CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-29 15:16:17 -07:00
Ian Campbell	685fd0b4ea	irq: Add new IRQ flag IRQF_NO_SUSPEND A small number of users of IRQF_TIMER are using it for the implied no suspend behaviour on interrupts which are not timer interrupts. Therefore add a new IRQF_NO_SUSPEND flag, rename IRQF_TIMER to __IRQF_TIMER and redefine IRQF_TIMER in terms of these new flags. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: xen-devel@lists.xensource.com Cc: linux-input@vger.kernel.org Cc: linuxppc-dev@ozlabs.org Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1280398595-29708-1-git-send-email-ian.campbell@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-29 13:24:57 +02:00
Thomas Gleixner	157b1a2385	kgdb: Do not access xtime directly The xtime cleanup missed the kgdb access to xtime. Fix it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-29 10:29:39 +02:00
Eric Paris	1968f5eed5	fanotify: use both marks when possible fanotify currently, when given a vfsmount_mark will look up (if it exists) the corresponding inode mark. This patch drops that lookup and uses the mark provided. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:55 -04:00
Eric Paris	ce8f76fb73	fsnotify: pass both the vfsmount mark and inode mark should_send_event() and handle_event() will both need to look up the inode event if they get a vfsmount event. Lets just pass both at the same time since we have them both after walking the lists in lockstep. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:54 -04:00
Eric Paris	43709a288e	fsnotify: remove group->mask group->mask is now useless. It was originally a shortcut for fsnotify to save on performance. These checks are now redundant, so we remove them. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:54 -04:00
Eric Paris	2612abb51b	fsnotify: cleanup should_send_event The change to use srcu and walk the object list rather than the global fsnotify_group list means that should_send_event is no longer needed for a number of groups and can be simplified for others. Do that. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:53 -04:00
Eric Paris	4cd76a4792	audit: use the mark in handler functions audit now gets a mark in the should_send_event and handle_event functions. Rather than look up the mark themselves audit should just use the mark it was handed. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:53 -04:00
Eric Paris	3a9b16b407	fsnotify: send fsnotify_mark to groups in event handling functions With the change of fsnotify to use srcu walking the marks list instead of walking the global groups list we now know the mark in question. The code can send the mark to the group's handling functions and the groups won't have to find those marks themselves. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:52 -04:00
Eric Paris	3bcf3860a4	fsnotify: store struct file not struct path Al explains that calling dentry_open() with a mnt/dentry pair is only garunteed to be safe if they are already used in an open struct file. To make sure this is the case don't store and use a struct path in fsnotify, always use a struct file. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:51 -04:00
Dave Young	d14f172948	sysctl extern cleanup: inotify Extern declarations in sysctl.c should be move to their own head file, and then include them in relavant .c files. Move inotify_table extern declaration to linux/inotify.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:59:01 -04:00
Alexey Dobriyan	6e006701cc	dnotify: move dir_notify_enable declaration Move dir_notify_enable declaration to where it belongs -- dnotify.h . Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:59:01 -04:00
Eric Paris	5444e2981c	fsnotify: split generic and inode specific mark code currently all marking is done by functions in inode-mark.c. Some of this is pretty generic and should be instead done in a generic function and we should only put the inode specific code in inode-mark.c Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:57 -04:00
Eric Paris	bbaa4168b2	fanotify: sys_fanotify_mark declartion This patch simply declares the new sys_fanotify_mark syscall int fanotify_mark(int fanotify_fd, unsigned int flags, u64_mask, int dfd const char *pathname) Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:55 -04:00
Eric Paris	11637e4b7d	fanotify: fanotify_init syscall declaration This patch defines a new syscall fanotify_init() of the form: int sys_fanotify_init(unsigned int flags, unsigned int event_f_flags, unsigned int priority) This syscall is used to create and fanotify group. This is very similar to the inotify_init() syscall. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:55 -04:00
Andreas Gruenbacher	3556608709	fsnotify: take inode->i_lock inside fsnotify_find_mark_entry() All callers to fsnotify_find_mark_entry() except one take and release inode->i_lock around the call. Take the lock inside fsnotify_find_mark_entry() instead. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:54 -04:00
Eric Paris	d07754412f	fsnotify: rename fsnotify_find_mark_entry to fsnotify_find_mark the _entry portion of fsnotify functions is useless. Drop it. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:53 -04:00
Eric Paris	e61ce86737	fsnotify: rename fsnotify_mark_entry to just fsnotify_mark The name is long and it serves no real purpose. So rename fsnotify_mark_entry to just fsnotify_mark. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:53 -04:00
Eric Paris	2823e04de4	fsnotify: put inode specific fields in an fsnotify_mark in a union The addition of marks on vfs mounts will be simplified if the inode specific parts of a mark and the vfsmnt specific parts of a mark are actually in a union so naming can be easy. This patch just implements the inode struct and the union. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:52 -04:00
Eric Paris	3a9fb89f4c	fsnotify: include vfsmount in should_send_event when appropriate To ensure that a group will not duplicate events when it receives it based on the vfsmount and the inode should_send_event test we should distinguish those two cases. We pass a vfsmount to this function so groups can make their own determinations. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:52 -04:00
Eric Paris	0d2e2a1d00	fsnotify: drop mask argument from fsnotify_alloc_group Nothing uses the mask argument to fsnotify_alloc_group. This patch drops that argument. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:51 -04:00
Eric Paris	220d14df0d	Audit: only set group mask when something is being watched Currently the audit watch group always sets a mask equal to all events it might care about. We instead should only set the group mask if we are actually watching inodes. This should be a perf win when audit watches are compiled in. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:51 -04:00
Eric Paris	ffab83402f	fsnotify: fsnotify_obtain_group should be fsnotify_alloc_group fsnotify_obtain_group was intended to be able to find an already existing group. Nothing uses that functionality. This just renames it to fsnotify_alloc_group so it is clear what it is doing. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:50 -04:00
Eric Paris	74be0cc828	fsnotify: remove group_num altogether The original fsnotify interface has a group-num which was intended to be able to find a group after it was added. I no longer think this is a necessary thing to do and so we remove the group_num. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:50 -04:00
Eric Paris	8112e2d6a7	fsnotify: include data in should_send calls fanotify is going to need to look at file->private_data to know if an event should be sent or not. This passes the data (which might be a file, dentry, inode, or none) to the should_send function calls so fanotify can get that information when available Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:31 -04:00
Eric Paris	7b0a04fbfb	fsnotify: provide the data type to should_send_event fanotify is only interested in event types which contain enough information to open the original file in the context of the fanotify listener. Since fanotify may not want to send events if that data isn't present we pass the data type to the should_send_event function call so fanotify can express its lack of interest. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:31 -04:00
Eric Paris	2dfc1cae4c	inotify: remove inotify in kernel interface nothing uses inotify in the kernel, drop it! Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:31 -04:00
Eric Paris	1a3aedbce4	Audit: audit watch init should not be before fsnotify init Audit watch init and fsnotify init both use subsys_initcall() but since the audit watch code is linked in before the fsnotify code the audit watch code would be using the fsnotify srcu struct before it was initialized. This patch fixes that problem by moving audit watch init to device_initcall() so it happens after fsnotify is ready. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Eric Paris <eparis@redhat.com> Tested-by : Sachin Sant <sachinp@in.ibm.com>	2010-07-28 09:58:19 -04:00
Eric Paris	939a67fc4c	Audit: split audit watch Kconfig Audit watch should depend on CONFIG_AUDIT_SYSCALL and should select FSNOTIFY. This splits the spagetti like mixing of audit_watch and audit_filter code so they can be configured seperately. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:19 -04:00
Eric Paris	28a3a7eb3b	audit: reimplement audit_trees using fsnotify rather than inotify Simply switch audit_trees from using inotify to using fsnotify for it's inode pinning and disappearing act information. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:17 -04:00
Eric Paris	40554c3dae	fsnotify: allow addition of duplicate fsnotify marks This patch allows a task to add a second fsnotify mark to an inode for the same group. This mark will be added to the end of the inode's list and this will never be found by the stand fsnotify_find_mark() function. This is useful if a user wants to add a new mark before removing the old one. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:17 -04:00
Eric Paris	a05fb6cc57	audit: do not get and put just to free a watch deleting audit watch rules is not currently done under audit_filter_mutex. It was done this way because we could not hold the mutex during inotify manipulation. Since we are using fsnotify we don't need to do the extra get/put pair nor do we need the private list on which to store the parents while they are about to be freed. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:17 -04:00
Eric Paris	e118e9c563	audit: redo audit watch locking and refcnt in light of fsnotify fsnotify can handle mutexes to be held across all fsnotify operations since it deals strickly in spinlocks. This can simplify and reduce some of the audit_filter_mutex taking and dropping. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:16 -04:00
Eric Paris	e9fd702a58	audit: convert audit watches to use fsnotify instead of inotify Audit currently uses inotify to pin inodes in core and to detect when watched inodes are deleted or unmounted. This patch uses fsnotify instead of inotify. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:16 -04:00
Eric Paris	ae7b8f4108	Audit: clean up the audit_watch split No real changes, just cleanup to the audit_watch split patch which we done with minimal code changes for easy review. Now fix interfaces to make things work better. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:16 -04:00
Sridhar Samudrala	d7926ee38f	cgroups: Add an API to attach a task to current task's cgroup Add a new kernel API to attach a task to current task's cgroup in all the active hierarchies. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com>	2010-07-28 15:45:12 +03:00
Jason Baron	b82bab4bbe	dynamic debug: move ddebug_remove_module() down into free_module() The command echo "file ec.c +p" >/sys/kernel/debug/dynamic_debug/control causes an oops. Move the call to ddebug_remove_module() down into free_module(). In this way it should be called from all error paths. Currently, we are missing the remove if the module init routine fails. Signed-off-by: Jason Baron <jbaron@redhat.com> Reported-by: Thomas Renninger <trenn@suse.de> Tested-by: Thomas Renninger <trenn@suse.de> Cc: <stable@kernel.org> [2.6.32+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-27 14:32:06 -07:00
John Stultz	852db46d55	clocksource: Add __clocksource_updatefreq_hz/khz methods To properly handle clocksources that change frequencies at the clocksource->enable() point, this patch adds a method that will update the clocksource's mult/shift and max_idle_ns values. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-12-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:55 +02:00
John Stultz	0fb86b0629	timekeeping: Make xtime and wall_to_monotonic static This patch makes xtime and wall_to_monotonic static, as planned in Documentation/feature-removal-schedule.txt. This will allow for further cleanups to the timekeeping core. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-10-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:55 +02:00
John Stultz	8ab4351a4c	hrtimer: Cleanup direct access to wall_to_monotonic Provides an accessor function to replace hrtimer.c's direct access of wall_to_monotonic. This will allow wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-9-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:55 +02:00
John Stultz	7615856ebf	timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset update_vsyscall() did not provide the wall_to_monotoinc offset, so arch specific implementations tend to reference wall_to_monotonic directly. This limits future cleanups in the timekeeping core, so this patch fixes the update_vsyscall interface to provide wall_to_monotonic, allowing wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:54 +02:00
John Stultz	592913ecb8	time: Kill off CONFIG_GENERIC_TIME Now that all arches have been converted over to use generic time via clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME config option and simplify the generic code. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:54 +02:00
John Stultz	ce3bf7ab22	time: Implement timespec_add After accidentally misusing timespec_add_safe, I wanted to make sure we don't accidently trip over that issue again, so I created a simple timespec_add() function which we can use to replace the instances of timespec_add_safe() that don't want the overflow detection. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-3-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:53 +02:00
Steffen Klassert	7424713b83	padata: Check for valid cpumasks Now that we allow to change the cpumasks from userspace, we have to check for valid cpumasks in padata_do_parallel. This patch adds the necessary check. This fixes a division by zero crash if the parallel cpumask contains no active cpu. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-26 14:13:58 +08:00
Steffen Klassert	b89661dff5	padata: Allocate cpumask dependend recources in any case The cpumask separation work assumes the cpumask dependend recources present regardless of valid or invalid cpumasks. With this patch we allocate the cpumask dependend recources in any case. This fixes two NULL pointer dereference crashes in padata_replace and in padata_get_cpumask. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-26 14:13:58 +08:00
Steffen Klassert	fad3a906d3	padata: Fix cpu index counting The counting of the cpu index got lost with a recent commit. This patch restores it. This fixes a hang of the parallel worker threads on cpu hotplug. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-26 14:13:57 +08:00
Andrey Vagin	2b08de0073	posix_timer: Move copy_to_user(created_timer_id) down in timer_create() According to Oleg Nesterov: We can move copy_to_user(created_timer_id) down after "if (timer_event_spec)" block too. (but before CLOCK_DISPATCH(), of course). Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Andrey Vagin <avagin@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-23 15:08:12 +02:00
Patrick Pannuto	22b8f15c2f	timer: Added usleep[_range] timer usleep[_range] are finer precision implementations of msleep and are designed to be drop-in replacements for udelay where a precise sleep / busy-wait is unnecessary. They also allow an easy interface to specify slack when a precise (ish) wakeup is unnecessary to help minimize wakeups Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> Cc: akinobu.mita@gmail.com Cc: sboyd@codeaurora.org Acked-by: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <4C44CDD2.1070708@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-23 15:08:12 +02:00
J. Bruce Fields	866e26115c	timers: Document meaning of deferrable timer Steal some text from `6e453a6751` "Add support for deferrable timers". A reader shouldn't have to dig through the git logs for the basic description of a deferrable timer. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: johnstul@us.ibm.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-23 15:08:12 +02:00
Tejun Heo	181a51f6e0	slow-work: kill it slow-work doesn't have any user left. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Howells <dhowells@redhat.com>	2010-07-23 13:14:49 +02:00
Ingo Molnar	3a01736e70	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-07-23 09:10:29 +02:00
Tejun Heo	e120153ddf	workqueue: fix how cpu number is stored in work->data Once a work starts execution, its data contains the cpu number it was on instead of pointing to cwq. This is added by commit `7a22ad75` (workqueue: carry cpu number in work data once execution starts) to reliably determine the work was last on even if the workqueue itself was destroyed inbetween. Whether data points to a cwq or contains a cpu number was distinguished by comparing the value against PAGE_OFFSET. The assumption was that a cpu number should be below PAGE_OFFSET while a pointer to cwq should be above it. However, on architectures which use separate address spaces for user and kernel spaces, this doesn't hold as PAGE_OFFSET is zero. Fix it by using an explicit flag, WORK_STRUCT_CWQ, to mark what the data field contains. If the flag is set, it's pointing to a cwq; otherwise, it contains a cpu number. Reported on s390 and microblaze during linux-next testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sachin Sant <sachinp@in.ibm.com> Reported-by: Michal Simek <michal.simek@petalogix.com> Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Michal Simek <monstr@monstr.eu>	2010-07-22 22:39:22 +02:00
Dan Carpenter	24a461d537	trace: strlen() return doesn't account for the NULL We need to add one to the strlen() return because of the NULL character. The type->name here generally comes from the kernel and I don't think any of them come close to being MAX_TRACER_SIZE (100) characters long so this is basically a cleanup. Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20100710100644.GV19184@bicker> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-22 14:56:41 -04:00
Jason Wessel	edd63cb6b9	sysrq,kdb: Use __handle_sysrq() for kdb's sysrq function The kdb code should not toggle the sysrq state in case an end user wants to try and resume the normal kernel execution. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2010-07-21 19:27:07 -05:00
Jason Wessel	b0679c63db	debug_core,kdb: fix kgdb_connected bit set in the wrong place Immediately following an exit from the kdb shell the kgdb_connected variable should be set to zero, unless there are breakpoints planted. If the kgdb_connected variable is not zeroed out with kdb, it is impossible to turn off kdb. This patch is merely a work around for now, the real fix will check for the breakpoints. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:07 -05:00
Jason Wessel	9e8b624fca	Fix merge regression from external kdb to upstream kdb In the process of merging kdb to the mainline, the kdb lsmod command stopped printing the base load address of kernel modules. This is needed for using kdb in conjunction with external tools such as gdb. Simply restore the functionality by adding a kdb_printf for the base load address of the kernel modules. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:06 -05:00
Jason Wessel	fb82c0ff27	repair gdbstub to match the gdbserial protocol specification The gdbserial protocol handler should return an empty packet instead of an error string when ever it responds to a command it does not implement. The problem cases come from a debugger client sending qTBuffer, qTStatus, qSearch, qSupported. The incorrect response from the gdbstub leads the debugger clients to not function correctly. Recent versions of gdb will not detach correctly as a result of this behavior. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>	2010-07-21 19:27:05 -05:00
Martin Hicks	1396a21ba0	kdb: break out of kdb_ll() when command is terminated Without this patch the "ll" linked-list traversal command won't terminate when you hit q/Q. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:05 -05:00
Josh Hunt	eebef74695	sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug The sched_child_runs_first value in /proc/sched_debug is currently displayed using a macro meant to split ns time values. This patch uses the correct macro to display it as a plain decimal value. Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: peterz@infradead.org Cc: juhlenko@akamai.com Cc: efault@gmx.de LKML-Reference: <1279567876-25418-1-git-send-email-johunt@akamai.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:46:12 +02:00
Ingo Molnar	dca45ad8af	Merge branch 'linus' into sched/core Merge reason: Move from the -rc3 to the almost-rc6 base. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:45:08 +02:00
Ingo Molnar	23c2875725	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-07-21 21:44:18 +02:00
Ingo Molnar	9dcdbf7a33	Merge branch 'linus' into perf/core Merge reason: Pick up the latest perf fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:43:06 +02:00
KOSAKI Motohiro	ef710e100c	tracing: Shrink max latency ringbuffer if unnecessary Documentation/trace/ftrace.txt says buffer_size_kb: This sets or displays the number of kilobytes each CPU buffer can hold. The tracer buffers are the same size for each CPU. The displayed number is the size of the CPU buffer and not total size of all buffers. The trace buffers are allocated in pages (blocks of memory that the kernel uses for allocation, usually 4 KB in size). If the last page allocated has room for more bytes than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size due to buffer management overhead. ) This can only be updated when the current_tracer is set to "nop". But it's incorrect. currently total memory consumption is 'buffer_size_kb x CPUs x 2'. Why two times difference is there? because ftrace implicitly allocate the buffer for max latency too. That makes sad result when admin want to use large buffer. (If admin want full logging and makes detail analysis). example, If admin have 24 CPUs machine and write 200MB to buffer_size_kb, the system consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is usually unacceptable. Fortunatelly, almost all users don't use max latency feature. The max latency buffer can be disabled easily. This patch shrink buffer size of the max latency buffer if unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-21 10:20:17 -04:00
Lai Jiangshan	bc289ae98b	tracing: Reduce latency and remove percpu trace_seq __print_flags() and __print_symbolic() use percpu trace_seq: 1) Its memory is allocated at compile time, it wastes memory if we don't use tracing. 2) It is percpu data and it wastes more memory for multi-cpus system. 3) It disables preemption when it executes its core routine "trace_seq_printf(s, "%s: ", #call);" and introduces latency. So we move this trace_seq to struct trace_iterator. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4C078350.7090106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 22:05:34 -04:00
Richard Kennedy	985023dee6	trace: Reorder struct ring_buffer_per_cpu to remove padding on 64bit Reorder structure to remove 8 bytes of padding on 64 bit builds. This shrinks the size to 128 bytes so allowing allocation from a smaller slab & needed one fewer cache lines. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> LKML-Reference: <1269516456.2054.8.camel@localhost> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 21:58:44 -04:00
Li Zefan	e870e9a124	tracing: Allow to disable cmdline recording We found that even enabling a single trace event that will rarely be triggered can add big overhead to context switch. (lmbench context switch test) ------------------------------------------------- 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ------ ------ ------ ------ ------ ------- ------- 2.19 2.3 2.21 2.56 2.13 2.54 2.07 2.39 2.51 2.35 2.75 2.27 2.81 2.24 The overhead is 6% ~ 11%. It's because when a trace event is enabled 3 tracepoints (sched_switch, sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname. We'd like to avoid this overhead, so add a trace option '(no)record-cmd' to allow to disable cmdline recording. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 21:52:33 -04:00
Neil Horman	70d4bf6d46	drop_monitor: convert some kfree_skb call sites to consume_skb Convert a few calls from kfree_skb to consume_skb Noticed while I was working on dropwatch that I was detecting lots of internal skb drops in several places. While some are legitimate, several were not, freeing skbs that were at the end of their life, rather than being discarded due to an error. This patch converts those calls sites from using kfree_skb to consume_skb, which quiets the in-kernel drop_monitor code from detecting them as drops. Tested successfully by myself Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-20 13:28:05 -07:00
Tejun Heo	f2e005aaff	workqueue: fix mayday_mask handling on UP All cpumasks are assumed to have cpu 0 permanently set on UP, so it can't be used to signify whether there's something to be done for the CPU. workqueue was using cpumask to track which CPU requested rescuer assistance and this led rescuer thread to think there always are pending mayday requests on UP, which resulted in infinite busy loops. This patch fixes the problem by introducing mayday_mask_t and associated helpers which wrap cpumask on SMP and emulates its behavior using bitops and unsigned long on UP. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au>	2010-07-20 15:59:09 +02:00
Arnd Bergmann	b444786f1a	tracing: Use generic_file_llseek for debugfs The default for llseek will change to no_llseek, so the tracing debugfs files need to add explicit .llseek assignments. Since we're dealing with regular files from a VFS perspective, use generic_file_llseek. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1278538820-1392-10-git-send-email-arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-07-20 14:31:24 +02:00
Frederic Weisbecker	eb7beb5c09	tracing: Remove special traces Special traces type was only used by sysprof. Lets remove it now that sysprof ftrace plugin has been dropped. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Soeren Sandmann <sandmann@daimi.au.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-07-20 14:31:07 +02:00
Frederic Weisbecker	f376bf5ffb	tracing: Remove sysprof ftrace plugin The sysprof ftrace plugin doesn't seem to be seriously used somewhere. There is a branch in the sysprof tree that makes an interface to it, but the real sysprof tool uses either its own module or perf events. Drop the sysprof ftrace plugin then, as it's mostly useless. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Soeren Sandmann <sandmann@daimi.au.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-07-20 14:29:46 +02:00
Tejun Heo	931ac77ef6	workqueue: fix build problem on !CONFIG_SMP Commit `f3421797` (workqueue: implement unbound workqueue) incorrectly tested CONFIG_SMP as part of a C expression in alloc/free_cwqs(). As CONFIG_SMP is not defined in UP, this breaks build. Fix it by using Found during linux-next build test. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>	2010-07-20 11:15:14 +02:00
Catalin Marinas	9078370c0d	kmemleak: Add support for NO_BOOTMEM configurations With commits `08677214` and `59be5a8e`, alloc_bootmem()/free_bootmem() and friends use the early_res functions for memory management when NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the corresponding code paths for bootmem allocations. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: stable@kernel.org	2010-07-19 11:54:15 +01:00
Pavel Machek	a2531293db	update email address pavel@suse.cz no longer works, replace it with working address. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-07-19 10:56:54 +02:00
Dan Kruchinin	5e017dc3f8	padata: Added sysfs primitives to padata subsystem Added sysfs primitives to padata subsystem. Now API user may embedded kobject each padata instance contains into any sysfs hierarchy. For now padata sysfs interface provides only two objects: serial_cpumask [RW] - cpumask for serial workers parallel_cpumask [RW] - cpumask for parallel workers Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-19 13:50:19 +08:00
Dan Kruchinin	e15bacbebb	padata: Make two separate cpumasks The aim of this patch is to make two separate cpumasks for padata parallel and serial workers respectively. It allows user to make more thin and sophisticated configurations of padata framework. For example user may bind parallel and serial workers to non-intersecting CPU groups to gain better performance. Also each padata instance has notifiers chain for its cpumasks now. If either parallel or serial or both masks were changed all interested subsystems will get notification about that. It's especially useful if padata user uses algorithm for callback CPU selection according to serial cpumask. Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-19 13:50:19 +08:00
Rafael J. Wysocki	ce4410116c	PM / Suspend: Fix ordering of calls in suspend error paths The ACPI suspend code calls suspend_nvs_free() at a wrong place, which may lead to a memory leak if there's an error executing acpi_pm_prepare(), because acpi_pm_finish() will not be called in that case. However, the root cause of this problem is the apparently confusing ordering of calls in suspend error paths that needs to be fixed. In addition to that, fix a typo in a label name in suspend.c. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
Rafael J. Wysocki	d074ee023f	PM / Hibernate: Fix snapshot error code path There is an inconsistency between hibernation_platform_enter() and hibernation_snapshot(), because the latter calls hibernation_ops->end() after failing hibernation_ops->begin(), while the former doesn't do that. Make hibernation_snapshot() behave in the same way as hibernation_platform_enter() in that respect. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
Rafael J. Wysocki	f6f71f1875	PM / Hibernate: Fix hibernation_platform_enter() The hibernation_platform_enter() function calls dpm_suspend_noirq() instead of dpm_resume_noirq() by mistake. Fix this. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
James Bottomley	82f682514a	pm_qos: Get rid of the allocation in pm_qos_add_request() All current users of pm_qos_add_request() have the ability to supply the memory required by the pm_qos routines, so make them do this and eliminate the kmalloc() with pm_qos_add_request(). This has the double benefit of making the call never fail and allowing it to be called from atomic context. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 02:00:34 +02:00
James Bottomley	5f279845f9	pm_qos: Reimplement using plists A lot of the pm_qos extremal value handling is really duplicating what a priority ordered list does, just in a less efficient fashion. Simply redoing the implementation in terms of a plist gets rid of a lot of this junk (although there are several other strange things that could do with tidying up, like pm_qos_request_list has to carry the pm_qos_class with every node, simply because it doesn't get passed in to pm_qos_update_request even though every caller knows full well what parameter it's updating). I think this redo is a win independent of android, so we should do something like this now. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 02:00:18 +02:00
Rafael J. Wysocki	c125e96f04	PM: Make it possible to avoid races between wakeup and system sleep One of the arguments during the suspend blockers discussion was that the mainline kernel didn't contain any mechanisms making it possible to avoid races between wakeup and system suspend. Generally, there are two problems in that area. First, if a wakeup event occurs exactly when /sys/power/state is being written to, it may be delivered to user space right before the freezer kicks in, so the user space consumer of the event may not be able to process it before the system is suspended. Second, if a wakeup event occurs after user space has been frozen, it is not generally guaranteed that the ongoing transition of the system into a sleep state will be aborted. To address these issues introduce a new global sysfs attribute, /sys/power/wakeup_count, associated with a running counter of wakeup events and three helper functions, pm_stay_awake(), pm_relax(), and pm_wakeup_event(), that may be used by kernel subsystems to control the behavior of this attribute and to request the PM core to abort system transitions into a sleep state already in progress. The /sys/power/wakeup_count file may be read from or written to by user space. Reads will always succeed (unless interrupted by a signal) and return the current value of the wakeup events counter. Writes, however, will only succeed if the written number is equal to the current value of the wakeup events counter. If a write is successful, it will cause the kernel to save the current value of the wakeup events counter and to abort the subsequent system transition into a sleep state if any wakeup events are reported after the write has returned. [The assumption is that before writing to /sys/power/state user space will first read from /sys/power/wakeup_count. Next, user space consumers of wakeup events will have a chance to acknowledge or veto the upcoming system transition to a sleep state. Finally, if the transition is allowed to proceed, /sys/power/wakeup_count will be written to and if that succeeds, /sys/power/state will be written to as well. Still, if any wakeup events are reported to the PM core by kernel subsystems after that point, the transition will be aborted.] Additionally, put a wakeup events counter into struct dev_pm_info and make these per-device wakeup event counters available via sysfs, so that it's possible to check the activity of various wakeup event sources within the kernel. To illustrate how subsystems can use pm_wakeup_event(), make the low-level PCI runtime PM wakeup-handling code use it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>	2010-07-19 01:58:48 +02:00
Cesar Eduardo Barros	9013367339	PM / Hibernate: Fix typos in comments in kernel/power/swap.c There are a few typos in kernel/power/swap.c. Fix them. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 01:58:47 +02:00
Pekka Enberg	68c38fc3cb	sched: No need for bootmem special cases As of commit `dcce284` ("mm: Extend gfp masking to the page allocator") and commit `7e85ee0` ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:06:22 +02:00
Peter Zijlstra	396e894d28	sched: Revert nohz_ratelimit() for now Norbert reported that nohz_ratelimit() causes his laptop to burn about 4W (40%) extra. For now back out the change and see if we can adjust the power management code to make better decisions. Reported-by: Norbert Preining <preining@logic.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:05:44 +02:00
Peter Zijlstra	bbc8cb5bae	sched: Reduce update_group_power() calls Currently we update cpu_power() too often, update_group_power() only updates the local group's cpu_power but it gets called for all groups. Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus calling it, even though a slow update of cpu_power is sufficient. Therefore move the update under 'idle != CPU_NEWLY_IDLE && local_group' to reduce superfluous invocations. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1278612989.1900.176.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:05:14 +02:00
Suresh Siddha	5343bdb8fd	sched: Update rq->clock for nohz balanced cpus Suresh spotted that we don't update the rq->clock in the nohz load-balancer path. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:02:08 +02:00
Jiri Slaby	c022a0acad	rlimits: implement prlimit64 syscall This patch adds the code to support the sys_prlimit64 syscall which modifies-and-returns the rlim values of a selected process atomically. The first parameter, pid, being 0 means current process. Unlike the current implementation, it is a generic interface, architecture indepentent so that we needn't handle compat stuff anymore. In the future, after glibc start to use this we can deprecate sys_setrlimit and sys_getrlimit in favor to clean up the code finally. It also adds a possibility of changing limits of other processes. We check the user's permissions to do that and if it succeeds, the new limits are propagated online. This is good for large scale applications such as SAP or databases where administrators need to change limits time by time (e.g. on crashes increase core size). And it is unacceptable to restart the service. For safety, all rlim users now either use accessors or doesn't need them due to - locking - the fact a process was just forked and nobody else knows about it yet (and nobody can't thus read/write limits) hence it is safe to modify limits now. The limitation is that we currently stay at ulong internal representation. So the rlim64_is_infinity check is used where value is compared against ULONG_MAX on 32-bit which is the maximum value there. And since internally the limits are held in struct rlimit, converters which are used before and after do_prlimit call in sys_prlimit64 are introduced. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	b95183453a	rlimits: switch more rlimit syscalls to do_prlimit After we added more generic do_prlimit, switch sys_getrlimit to that. Also switch compat handling, so we can get rid of ugly __user casts and avoid setting process' address limit to kernel data and back. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	5b41535aac	rlimits: redo do_setrlimit to more generic do_prlimit It now allows also reading of limits. I.e. all read and writes will later use this function. It takes two parameters, new and old limits which can be both NULL. If new is non-NULL, the value in it is set to rlimits. If old is non-NULL, current rlimits are stored there. If both are non-NULL, old are stored prior to setting the new ones, atomically. (Similar to sigaction.) Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	86f162f4c7	rlimits: do security check under task_lock Do security_task_setrlimit under task_lock. Other tasks may change limits under our hands while we are checking limits inside the function. From now on, they can't. Note that all the security work is done under a spinlock here now. Security hooks count with that, they are called from interrupt context (like security_task_kill) and with spinlocks already held (e.g. capable->security_capable). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: James Morris <jmorris@namei.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>	2010-07-16 09:48:47 +02:00
Jiri Slaby	1c1e618ddd	rlimits: allow setrlimit to non-current tasks Add locking to allow setrlimit accept task parameter other than current. Namely, lock tasklist_lock for read and check whether the task structure has sighand non-null. Do all the signal processing under that lock still held. There are some points: 1) security_task_setrlimit is now called with that lock held. This is not new, many security_* functions are called with this lock held already so it doesn't harm (all this security_* stuff does almost the same). 2) task->sighand->siglock (in update_rlimit_cpu) is nested in tasklist_lock. This dependence is already existing. 3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already existing dependence. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com>	2010-07-16 09:48:47 +02:00
Jiri Slaby	7855c35da7	rlimits: split sys_setrlimit Create do_setrlimit from sys_setrlimit and declare do_setrlimit in the resource header. This is the first phase to have generic do_prlimit which allows to be called from read, write and compat rlimits code. The new do_setrlimit also accepts a task pointer to change the limits of. Currently, it cannot be other than current, but this will change with locking later. Also pass tsk->group_leader to security_task_setrlimit to check whether current is allowed to change rlimits of the process and not its arbitrary thread because it makes more sense given that rlimit are per process and not per-thread. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:46 +02:00
Oleg Nesterov	2fb9d2689a	rlimits: make sure ->rlim_max never grows in sys_setrlimit Mostly preparation for Jiri's changes, but probably makes sense anyway. sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when it takes task_lock() old_rlim->rlim_max can be already lowered. Move this check under task_lock(). Currently this is not important, we can only race with our sub-thread, this means the application is stupid. But when we change the code to allow the update of !current task's limits, it becomes important to make sure ->rlim_max can be lowered "reliably" even if we race with the application doing sys_setrlimit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:46 +02:00
Jiri Slaby	5ab46b345e	rlimits: add task_struct to update_rlimit_cpu Add task_struct as a parameter to update_rlimit_cpu to be able to set rlimit_cpu of different task than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: James Morris <jmorris@namei.org>	2010-07-16 09:48:45 +02:00
Jiri Slaby	8fd00b4d70	rlimits: security, add task_struct to setrlimit Add task_struct to task_setrlimit of security_operations to be able to set rlimit of task other than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Acked-by: James Morris <jmorris@namei.org>	2010-07-16 09:48:45 +02:00
Frederic Weisbecker	5d550467b9	tracing: Remove ksym tracer The ksym (breakpoint) ftrace plugin has been superseded by perf tools that are much more poweful to use the cpu breakpoints. This tracer doesn't bring more feature. It has been deprecated for a while now, lets remove it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-07-15 23:59:33 +02:00
Steffen Klassert	5f1a8c1bc7	padata: simplify serialization mechanism We count the number of processed objects on a percpu basis, so we need to go through all the percpu reorder queues to calculate the sequence number of the next object that needs serialization. This patch changes this to count the number of processed objects global. So we can calculate the sequence number and the percpu reorder queue of the next object that needs serialization without searching through the percpu reorder queues. This avoids some accesses to memory of foreign cpus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:30 +08:00
Steffen Klassert	83f619f3c8	padata: make padata_do_parallel to return zero on success To return -EINPROGRESS on success in padata_do_parallel was considered to be odd. This patch changes this to return zero on success. Also the only user of padata, pcrypt is adapted to convert a return of zero to -EINPROGRESS within the crypto layer. This also removes the pcrypt fallback if padata_do_parallel was called on a not running padata instance as we can't handle it anymore. This fallback was unused, so it's save to remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	33e5445068	padata: Handle empty padata cpumasks This patch fixes a bug when the padata cpumask does not intersect with the active cpumask. In this case we get a division by zero in padata_alloc_pd and we end up with a useless padata instance. Padata can end up with an empty cpumask for two reasons: 1. A user removed the last cpu that belongs to the padata cpumask and the active cpumask. 2. The last cpu that belongs to the padata cpumask and the active cpumask goes offline. We introduce a function padata_validate_cpumask to check if the padata cpumask does intersect with the active cpumask. If the cpumasks do not intersect we mark the instance as invalid, so it can't be used. We do not allocate the cpumask dependend recources in this case. This fixes the division by zero and keeps the padate instance in a consistent state. It's not possible to trigger this bug by now because the only padata user, pcrypt uses always the possible cpumask. Reported-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	ee83655512	padata: Block until the instance is unused on stop This patch makes padata_stop to block until the padata instance is unused. Also we split padata_stop to a locked and a unlocked version. This is in preparation to be able to change the cpumask after a call to patata stop. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	4c87917029	padata: Check for valid padata instance on start This patch introduces the PADATA_INVALID flag which is checked on padata start. This will be used to mark a padata instance as invalid, if the padata cpumask does not intersect with the active cpumask. we change padata_start to return an error if the PADATA_INVALID is set. Also we adapt the only padata user, pcrypt to this change. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:28 +08:00
Tejun Heo	9f9c23644b	workqueue: fix locking in retry path of maybe_create_worker() maybe_create_worker() mismanaged locking when worker creation fails and it has to retry. Fix locking and simplify lock manipulation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yong Zhang <yong.zhang@windriver.com>	2010-07-14 11:31:20 +02:00
Tejun Heo	083b804c4d	async: use workqueue for worker pool Replace private worker pool with system_unbound_wq. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Arjan van de Ven <arjan@infradead.org>	2010-07-14 11:29:46 +02:00
Uwe Kleine-König	698f93159a	fix comment/printk typos concerning "already" Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-07-11 21:45:40 +02:00
Arnd Bergmann	5e3d20a68f	init: Remove the BKL from startup code I have shown by code review that no driver takes the BKL at init time any more, so whatever the init code was locking against is no longer there and it is now safe to remove the BKL there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt <rostedt@goodmis> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-07-09 15:40:32 +02:00
Benjamin Herrenschmidt	5f07aa7524	Merge commit 'paulus-perf/master' into next	2010-07-09 11:25:48 +10:00
Kulikov Vasiliy	eb703f9819	kernel/watchdog: Initialize 'result' Variable on the stack is not initialized to zero, do it explicitly. This bug was found by a compiler warning: kernel/watchdog.c:463: warning: 'result' may be used uninitialized in this function Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1278316854-28442-1-git-send-email-segooon@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-07 08:46:42 +02:00
Ingo Molnar	8bd0e1be25	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-07-06 09:00:32 +02:00
Masami Hiramatsu	e09c8614b3	tracing/kprobes: Support "string" type Support string type tracing and printing in kprobe-tracer. This allows user to trace string data in kernel including __user data. Note that sometimes __user data may not be accessed if it is paged-out (sorry, but kprobes operation should be done in atomic, we can not wait for page-in). Commiter note: Fixed up conflicts with `b7e2ece`. Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-07-05 15:54:45 -03:00
Ingo Molnar	08f8ba0799	Merge commit 'v2.6.35-rc4' into perf/core Merge reason: Pick up the latest perf fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-05 08:30:58 +02:00
Yehuda Sadeh	ff49d74ad3	module: initialize module dynamic debug later We should initialize the module dynamic debug datastructures only after determining that the module is not loaded yet. This fixes a bug that introduced in 2.6.35-rc2, where when a trying to load a module twice, we also load it's dynamic printing data twice which causes all sorts of nasty issues. Also handle the dynamic debug cleanup later on failure. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed a #ifdef) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-04 20:17:22 -07:00
Linus Torvalds	123f94f22e	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Cure nr_iowait_cpu() users init: Fix comment init, sched: Fix race between init and kthreadd	2010-07-02 09:52:58 -07:00
Tejun Heo	c7fc77f78f	workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead WQ_SINGLE_CPU combined with @max_active of 1 is used to achieve full ordering among works queued to a workqueue. The same can be achieved using WQ_UNBOUND as unbound workqueues always use the gcwq for WORK_CPU_UNBOUND. As @max_active is always one and benefits from cpu locality isn't accessible anyway, serving them with unbound workqueues should be fine. Drop WQ_SINGLE_CPU support and use WQ_UNBOUND instead. Note that most single thread workqueue users will be converted to use multithread or non-reentrant instead and only the ones which require strict ordering will keep using WQ_UNBOUND + @max_active of 1. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 11:00:08 +02:00
Tejun Heo	f34217977d	workqueue: implement unbound workqueue This patch implements unbound workqueue which can be specified with WQ_UNBOUND flag on creation. An unbound workqueue has the following properties. * It uses a dedicated gcwq with a pseudo CPU number WORK_CPU_UNBOUND. This gcwq is always online and disassociated. * Workers are not bound to any CPU and not concurrency managed. Works are dispatched to workers as soon as possible and the only applied limitation is @max_active. IOW, all unbound workqeueues are implicitly high priority. Unbound workqueues can be used as simple execution context provider. Contexts unbound to any cpu are served as soon as possible. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com>	2010-07-02 11:00:02 +02:00
Tejun Heo	bdbc5dd7de	workqueue: prepare for WQ_UNBOUND implementation In preparation of WQ_UNBOUND addition, make the following changes. * Add WORK_CPU_* constants for pseudo cpu id numbers used (currently only WORK_CPU_NONE) and use them instead of NR_CPUS. This is to allow another pseudo cpu id for unbound cpu. * Reorder WQ_* flags. * Make workqueue_struct->cpu_wq a union which contains a percpu pointer, regular pointer and an unsigned long value and use kzalloc/kfree() in UP allocation path. This will be used to implement unbound workqueues which will use only one cwq on SMPs. * Move alloc_cwqs() allocation after initialization of wq fields, so that alloc_cwqs() has access to wq->flags. * Trivial relocation of wq local variables in freeze functions. These changes don't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:59:57 +02:00
Tejun Heo	d313dd85ad	workqueue: fix worker management invocation without pending works When there's no pending work to do, worker_thread() goes back to sleep after waking up without checking whether worker management is necessary. This means that idle worker exit requests can be ignored if the gcwq stays empty. Fix it by making worker_thread() always check whether worker management is necessary before going to sleep. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	a1e453d279	workqueue: fix incorrect cpu number BUG_ON() in get_work_gcwq() get_work_gcwq() was incorrectly triggering BUG_ON() if cpu number is equal to or higher than num_possible_cpus() instead of nr_cpu_ids. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	4ce48b37bf	workqueue: fix race condition in flush_workqueue() When one flusher is cascading to the next flusher, it first sets wq->first_flusher to the next one and sets up the next flush cycle. If there's nothing to do for the next cycle, it clears wq->flush_flusher and proceeds to the one after that. If the woken up flusher checks wq->first_flusher before it gets cleared, it will incorrectly assume the role of the first flusher, which triggers BUG_ON() sanity check. Fix it by checking wq->first_flusher again after grabbing the mutex. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	cb44476699	workqueue: use worker_set/clr_flags() only from worker itself worker_set/clr_flags() assume that if none of NOT_RUNNING flags is set the worker must be contributing to nr_running which is only true if the worker is actually running. As when called from self, it is guaranteed that the worker is running, those functions can be safely used from the worker itself and they aren't necessary from other places anyway. Make the following changes to fix the bug. * Make worker_set/clr_flags() whine if not called from self. * Convert all places which called those functions from other tasks to manipulate flags directly. * Make trustee_thread() directly clear nr_running after setting WORKER_ROGUE on all workers. This is the only place where nr_running manipulation is necessary outside of workers themselves. * While at it, add sanity check for nr_running in worker_enter_idle(). Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:50 +02:00
Peter Zijlstra	8c215bd389	sched: Cure nr_iowait_cpu() users Commit `0224cf4c5e` (sched: Intoduce get_cpu_iowait_time_us()) broke things by not making sure preemption was indeed disabled by the callers of nr_iowait_cpu() which took the iowait value of the current cpu. This resulted in a heap of preempt warnings. Cure this by making nr_iowait_cpu() take a cpu number and fix up the callers to pass in the right number. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: linux-pm@lists.linux-foundation.org LKML-Reference: <1277968037.1868.120.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-01 09:39:48 +02:00
Ingo Molnar	0a54cec0c2	Merge branch 'linus' into core/rcu Conflicts: fs/fs-writeback.c Merge reason: Resolve the conflict Note, i picked the version from Linus's tree, which effectively reverts the fs-writeback.c bits of: `b97181f`: fs: remove all rcu head initializations, except on_stack initializations As the upstream changes to this file changed this code heavily and the first attempt to resolve the conflict resulted in a non-booting kernel. It's safer to re-try this portion of the commit cleanly. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-01 09:31:25 +02:00
Michal Hocko	7a0ea09ad5	futex: futex_find_get_task remove credentails check futex_find_get_task is currently used (through lookup_pi_state) from two contexts, futex_requeue and futex_lock_pi_atomic. None of the paths looks it needs the credentials check, though. Different (e)uids shouldn't matter at all because the only thing that is important for shared futex is the accessibility of the shared memory. The credentail check results in glibc assert failure or process hang (if glibc is compiled without assert support) for shared robust pthread mutex with priority inheritance if a process tries to lock already held lock owned by a process with a different euid: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 \|\| !robust' failed. The problem is that futex_lock_pi_atomic which is called when we try to lock already held lock checks the current holder (tid is stored in the futex value) to get the PI state. It uses lookup_pi_state which in turn gets task struct from futex_find_get_task. ESRCH is returned either when the task is not found or if credentials check fails. futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code, however, doesn't expect that robust lock returns with ESRCH because it should get either success or owner died. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-30 15:43:44 -07:00
Pavan Naregundi	e05bd3367b	kexec: fix Oops in crash_shrink_memory() When crashkernel is not enabled, "echo 0 > /sys/kernel/kexec_crash_size" OOPSes the kernel in crash_shrink_memory. This happens when crash_shrink_memory tries to release the 'crashk_res' resource which are not reserved. Also value of "/sys/kernel/kexec_crash_size" shows as 1, which should be 0. This patch fixes the OOPS in crash_shrink_memory and shows "/sys/kernel/kexec_crash_size" as 0 when crash kernel memory is not reserved. Signed-off-by: Pavan Naregundi <pavan@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-29 15:29:31 -07:00
Michael Neuling	2ec57d448b	sched: Fix spelling of sibling No logic changes, only spelling. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org Cc: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <15249.1277776921@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-29 10:44:29 +02:00
Tejun Heo	fb0e7beb5c	workqueue: implement cpu intensive workqueue This patch implements cpu intensive workqueue which can be specified with WQ_CPU_INTENSIVE flag on creation. Works queued to a cpu intensive workqueue don't participate in concurrency management. IOW, it doesn't contribute to gcwq->nr_running and thus doesn't delay excution of other works. Note that although cpu intensive works won't delay other works, they can be delayed by other works. Combine with WQ_HIGHPRI to avoid being delayed by other works too. As the name suggests this is useful when using workqueue for cpu intensive works. Workers executing cpu intensive works are not considered for workqueue concurrency management and left for the scheduler to manage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:15 +02:00
Tejun Heo	649027d73a	workqueue: implement high priority workqueue This patch implements high priority workqueue which can be specified with WQ_HIGHPRI flag on creation. A high priority workqueue has the following properties. * A work queued to it is queued at the head of the worklist of the respective gcwq after other highpri works, while normal works are always appended at the end. * As long as there are highpri works on gcwq->worklist, [__]need_more_worker() remains %true and process_one_work() wakes up another worker before it start executing a work. The above two properties guarantee that works queued to high priority workqueues are dispatched to workers and start execution as soon as possible regardless of the state of other works. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	dcd989cb73	workqueue: implement several utility APIs Implement the following utility APIs. workqueue_set_max_active() : adjust max_active of a wq workqueue_congested() : test whether a wq is contested work_cpu() : determine the last / current cpu of a work work_busy() : query whether a work is busy * Anton Blanchard fixed missing ret initialization in work_busy(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	d320c03830	workqueue: s/__create_workqueue()/alloc_workqueue()/, and add system workqueues This patch makes changes to make new workqueue features available to its users. * Now that workqueue is more featureful, there should be a public workqueue creation function which takes paramters to control them. Rename __create_workqueue() to alloc_workqueue() and make 0 max_active mean WQ_DFL_ACTIVE. In the long run, all create_workqueue_() will be converted over to alloc_workqueue(). To further unify access interface, rename keventd_wq to system_wq and export it. * Add system_long_wq and system_nrt_wq. The former is to host long running works separately (so that flush_scheduled_work() dosen't take so long) and the latter guarantees any queued work item is never executed in parallel by multiple CPUs. These will be used by future patches to update workqueue users. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	b71ab8c202	workqueue: increase max_active of keventd and kill current_is_keventd() Define WQ_MAX_ACTIVE and create keventd with max_active set to half of it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1 works concurrently. Unless some combination can result in dependency loop longer than max_active, deadlock won't happen and thus it's unnecessary to check whether current_is_keventd() before trying to schedule a work. Kill current_is_keventd(). (Lockdep annotations are broken. We need lock_map_acquire_read_norecurse()) Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com>	2010-06-29 10:07:14 +02:00
Tejun Heo	e22bee782b	workqueue: implement concurrency managed dynamic worker pool Instead of creating a worker for each cwq and putting it into the shared pool, manage per-cpu workers dynamically. Works aren't supposed to be cpu cycle hogs and maintaining just enough concurrency to prevent work processing from stalling due to lack of processing context is optimal. gcwq keeps the number of concurrent active workers to minimum but no less. As long as there's one or more running workers on the cpu, no new worker is scheduled so that works can be processed in batch as much as possible but when the last running worker blocks, gcwq immediately schedules new worker so that the cpu doesn't sit idle while there are works to be processed. gcwq always keeps at least single idle worker around. When a new worker is necessary and the worker is the last idle one, the worker assumes the role of "manager" and manages the worker pool - ie. creates another worker. Forward-progress is guaranteed by having dedicated rescue workers for workqueues which may be necessary while creating a new worker. When the manager is having problem creating a new worker, mayday timer activates and rescue workers are summoned to the cpu and execute works which might be necessary to create new workers. Trustee is expanded to serve the role of manager while a CPU is being taken down and stays down. As no new works are supposed to be queued on a dead cpu, it just needs to drain all the existing ones. Trustee continues to try to create new workers and summon rescuers as long as there are pending works. If the CPU is brought back up while the trustee is still trying to drain the gcwq from the previous offlining, the trustee will kill all idles ones and tell workers which are still busy to rebind to the cpu, and pass control over to gcwq which assumes the manager role as necessary. Concurrency managed worker pool reduces the number of workers drastically. Only workers which are necessary to keep the processing going are created and kept. Also, it reduces cache footprint by avoiding unnecessarily switching contexts between different workers. Please note that this patch does not increase max_active of any workqueue. All workqueues can still only process one work per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	d302f01782	workqueue: implement worker_{set\|clr}_flags() Implement worker_{set\|clr}_flags() to manipulate worker flags. These are currently simple wrappers but logics to track the current worker state and the current level of concurrency will be added. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	7e11629d0e	workqueue: use shared worklist and pool all workers per cpu Use gcwq->worklist instead of cwq->worklist and break the strict association between a cwq and its worker. All works queued on a cpu are queued on gcwq->worklist and processed by any available worker on the gcwq. As there no longer is strict association between a cwq and its worker, whether a work is executing can now only be determined by calling [__]find_worker_executing_work(). After this change, the only association between a cwq and its worker is that a cwq puts a worker into shared worker pool on creation and kills it on destruction. As all workqueues are still limited to max_active of one, this means that there are always at least as many workers as active works and thus there's no danger for deadlock. The break of strong association between cwqs and workers requires somewhat clumsy changes to current_is_keventd() and destroy_workqueue(). Dynamic worker pool management will remove both clumsy changes. current_is_keventd() won't be necessary at all as the only reason it exists is to avoid queueing a work from a work which will be allowed just fine. The clumsy part of destroy_workqueue() is added because a worker can only be destroyed while idle and there's no guarantee a worker is idle when its wq is going down. With dynamic pool management, workers are not associated with workqueues at all and only idle ones will be submitted to destroy_workqueue() so the code won't be necessary anymore. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	18aa9effad	workqueue: implement WQ_NON_REENTRANT With gcwq managing all the workers and work->data pointing to the last gcwq it was on, non-reentrance can be easily implemented by checking whether the work is still running on the previous gcwq on queueing. Implement it. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	7a22ad757e	workqueue: carry cpu number in work data once execution starts To implement non-reentrant workqueue, the last gcwq a work was executed on must be reliably obtainable as long as the work structure is valid even if the previous workqueue has been destroyed. To achieve this, work->data will be overloaded to carry the last cpu number once execution starts so that the previous gcwq can be located reliably. This means that cwq can't be obtained from work after execution starts but only gcwq. Implement set_work_{cwq\|cpu}(), get_work_[g]cwq() and clear_work_data() to set work data to the cpu number when starting execution, access the overloaded work data and clear it after cancellation. queue_delayed_work_on() is updated to preserve the last cpu while in-flight in timer and other callers which depended on getting cwq from work after execution starts are converted to depend on gcwq instead. * Anton Blanchard fixed compile error on powerpc due to missing linux/threads.h include. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	8cca0eea39	workqueue: add find_worker_executing_work() and track current_cwq Now that all the workers are tracked by gcwq, we can find which worker is executing a work from gcwq. Implement find_worker_executing_work() and make worker track its current_cwq so that we can find things the other way around. This will be used to implement non-reentrant wqs. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	502ca9d819	workqueue: make single thread workqueue shared worker pool friendly Reimplement st (single thread) workqueue so that it's friendly to shared worker pool. It was originally implemented by confining st workqueues to use cwq of a fixed cpu and always having a worker for the cpu. This implementation isn't very friendly to shared worker pool and suboptimal in that it ends up crossing cpu boundaries often. Reimplement st workqueue using dynamic single cpu binding and cwq->limit. WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU. In a single cpu workqueue, at most single cwq is bound to the wq at any given time. Arbitration is done using atomic accesses to wq->single_cpu when queueing a work. Once bound, the binding stays till the workqueue is drained. Note that the binding is never broken while a workqueue is frozen. This is because idle cwqs may have works waiting in delayed_works queue while frozen. On thaw, the cwq is restarted if there are any delayed works or unbound otherwise. When combined with max_active limit of 1, single cpu workqueue has exactly the same execution properties as the original single thread workqueue while allowing sharing of per-cpu workers. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	db7bccf45c	workqueue: reimplement CPU hotplugging support using trustee Reimplement CPU hotplugging support using trustee thread. On CPU down, a trustee thread is created and each step of CPU down is executed by the trustee and workqueue_cpu_callback() simply drives and waits for trustee state transitions. CPU down operation no longer waits for works to be drained but trustee sticks around till all pending works have been completed. If CPU is brought back up while works are still draining, workqueue_cpu_callback() tells trustee to step down and tell workers to rebind to the cpu. As it's difficult to tell whether cwqs are empty if it's freezing or frozen, trustee doesn't consider draining to be complete while a gcwq is freezing or frozen (tracked by new GCWQ_FREEZING flag). Also, workers which get unbound from their cpu are marked with WORKER_ROGUE. Trustee based implementation doesn't bring any new feature at this point but it will be used to manage worker pool when dynamic shared worker pool is implemented. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	c8e55f3602	workqueue: implement worker states Implement worker states. After created, a worker is STARTED. While a worker isn't processing a work, it's IDLE and chained on gcwq->idle_list. While processing a work, a worker is BUSY and chained on gcwq->busy_hash. Also, gcwq now counts the number of all workers and idle ones. worker_thread() is restructured to reflect state transitions. cwq->more_work is removed and waking up a worker makes it check for events. A worker is killed by setting DIE flag while it's IDLE and waking it up. This gives gcwq better visibility of what's going on and allows it to find out whether a work is executing quickly which is necessary to have multiple workers processing the same cwq. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	8b03ae3cde	workqueue: introduce global cwq and unify cwq locks There is one gcwq (global cwq) per each cpu and all cwqs on an cpu point to it. A gcwq contains a lock to be used by all cwqs on the cpu and an ida to give IDs to workers belonging to the cpu. This patch introduces gcwq, moves worker_ida into gcwq and make all cwqs on the same cpu use the cpu's gcwq->lock instead of separate locks. gcwq->ida is now protected by gcwq->lock too. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	a0a1a5fd4f	workqueue: reimplement workqueue freeze using max_active Currently, workqueue freezing is implemented by marking the worker freezeable and calling try_to_freeze() from dispatch loop. Reimplement it using cwq->limit so that the workqueue is frozen instead of the worker. * workqueue_struct->saved_max_active is added which stores the specified max_active on initialization. * On freeze, all cwq->max_active's are quenched to zero. Freezing is complete when nr_active on all cwqs reach zero. * On thaw, all cwq->max_active's are restored to wq->saved_max_active and the worklist is repopulated. This new implementation allows having single shared pool of workers per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	1e19ffc63d	workqueue: implement per-cwq active work limit Add cwq->nr_active, cwq->max_active and cwq->delayed_work. nr_active counts the number of active works per cwq. A work is active if it's flushable (colored) and is on cwq's worklist. If nr_active reaches max_active, new works are queued on cwq->delayed_work and activated later as works on the cwq complete and decrement nr_active. cwq->max_active can be specified via the new @max_active parameter to __create_workqueue() and is set to 1 for all workqueues for now. As each cwq has only single worker now, this double queueing doesn't cause any behavior difference visible to its users. This will be used to reimplement freeze/thaw and implement shared worker pool. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	affee4b294	workqueue: reimplement work flushing using linked works A work is linked to the next one by having WORK_STRUCT_LINKED bit set and these links can be chained. When a linked work is dispatched to a worker, all linked works are dispatched to the worker's newly added ->scheduled queue and processed back-to-back. Currently, as there's only single worker per cwq, having linked works doesn't make any visible behavior difference. This change is to prepare for multiple shared workers per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	c34056a3fd	workqueue: introduce worker Separate out worker thread related information to struct worker from struct cpu_workqueue_struct and implement helper functions to deal with the new struct worker. The only change which is visible outside is that now workqueue worker are all named "kworker/CPUID:WORKERID" where WORKERID is allocated from per-cpu ida. This is in preparation of concurrency managed workqueue where shared multiple workers would be available per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	73f53c4aa7	workqueue: reimplement workqueue flushing using color coded works Reimplement workqueue flushing using color coded works. wq has the current work color which is painted on the works being issued via cwqs. Flushing a workqueue is achieved by advancing the current work colors of cwqs and waiting for all the works which have any of the previous colors to drain. Currently there are 16 possible colors, one is reserved for no color and 15 colors are useable allowing 14 concurrent flushes. When color space gets full, flush attempts are batched up and processed together when color frees up, so even with many concurrent flushers, the new implementation won't build up huge queue of flushers which has to be processed one after another. Only works which are queued via __queue_work() are colored. Works which are directly put on queue using insert_work() use NO_COLOR and don't participate in workqueue flushing. Currently only works used for work-specific flush fall in this category. This new implementation leaves only cleanup_workqueue_thread() as the user of flush_cpu_workqueue(). Just make its users use flush_workqueue() and kthread_stop() directly and kill cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier request anymore, the comment describing the complex synchronization around it in cleanup_workqueue_thread() is removed together with the function. This new implementation is to allow having and sharing multiple workers per cpu. Please note that one more bit is reserved for a future work flag by this patch. This is to avoid shifting bits and updating comments later. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	0f900049cb	workqueue: update cwq alignement work->data field is used for two purposes. It points to cwq it's queued on and the lower bits are used for flags. Currently, two bits are reserved which is always safe as 4 byte alignment is guaranteed on every architecture. However, future changes will need more flag bits. On SMP, the percpu allocator is capable of honoring larger alignment (there are other users which depend on it) and larger alignment works just fine. On UP, percpu allocator is a thin wrapper around kzalloc/kfree() and don't honor alignment request. This patch introduces WORK_STRUCT_FLAG_BITS and implements alloc/free_cwqs() which guarantees max(1 << WORK_STRUCT_FLAG_BITS, __alignof__(unsigned long long) alignment both on SMP and UP. On SMP, simply wrapping percpu allocator is enough. On UP, extra space is allocated so that cwq can be aligned and the original pointer can be stored after it which is used in the free path. * Alignment problem on UP is reported by Michal Simek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: Michal Simek <michal.simek@petalogix.com>	2010-06-29 10:07:11 +02:00
Tejun Heo	1537663f57	workqueue: kill cpu_populated_map Worker management is about to be overhauled. Simplify things by removing cpu_populated_map, creating workers for all possible cpus and making single threaded workqueues behave more like multi threaded ones. After this patch, all cwqs are always initialized, all workqueues are linked on the workqueues list and workers for all possibles cpus always exist. This also makes CPU hotplug support simpler - checking ->cpus_allowed before processing works in worker_thread() and flushing cwqs on CPU_POST_DEAD are enough. While at it, make get_cwq() always return the cwq for the specified cpu, add target_cwq() for cases where single thread distinction is necessary and drop all direct usage of per_cpu_ptr() on wq->cpu_wq. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	6416669975	workqueue: temporarily remove workqueue tracing Strip tracing code from workqueue and remove workqueue tracing. This is temporary measure till concurrency managed workqueue is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>	2010-06-29 10:07:11 +02:00
Tejun Heo	a62428c0ae	workqueue: separate out process_one_work() Separate out process_one_work() out of run_workqueue(). This patch doesn't cause any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	22df02bb3f	workqueue: define masks for work flags and conditionalize STATIC flags Work flags are about to see more traditional mask handling. Define WORK_STRUCT__BIT as the bit position constant and redefine WORK_STRUCT_ as bit masks. Also, make WORK_STRUCT_STATIC_* flags conditional While at it, re-define these constants as enums and use WORK_STRUCT_STATIC instead of hard-coding 2 in WORK_DATA_STATIC_INIT(). Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	97e37d7b9e	workqueue: merge feature parameters into flags Currently, __create_workqueue_key() takes @singlethread and @freezeable paramters and store them separately in workqueue_struct. Merge them into a single flags parameter and field and use WQ_FREEZEABLE and WQ_SINGLE_THREAD. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	4690c4ab56	workqueue: misc/cosmetic updates Make the following updates in preparation of concurrency managed workqueue. None of these changes causes any visible behavior difference. * Add comments and adjust indentations to data structures and several functions. * Rename wq_per_cpu() to get_cwq() and swap the position of two parameters for consistency. Convert a direct per_cpu_ptr() access to wq->cpu_wq to get_cwq(). * Add work_static() and Update set_wq_data() such that it sets the flags part to WORK_STRUCT_PENDING \| WORK_STRUCT_STATIC if static \| @extra_flags. * Move santiy check on work->entry emptiness from queue_work_on() to __queue_work() which all queueing paths share. * Make __queue_work() take @cpu and @wq instead of @cwq. * Restructure flush_work() and __create_workqueue_key() to make them easier to modify. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	c790bce048	workqueue: kill RT workqueue With stop_machine() converted to use cpu_stop, RT workqueue doesn't have any user left. Kill RT workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:09 +02:00
Tejun Heo	82805ab77d	kthread: implement kthread_data() Implement kthread_data() which takes @task pointing to a kthread and returns @data specified when creating the kthread. The caller is responsible for ensuring the validity of @task when calling this function. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:09 +02:00
Tejun Heo	b56c0d8937	kthread: implement kthread_worker Implement simple work processor for kthread. This is to ease using kthread. Single thread workqueue used to be used for things like this but workqueue won't guarantee fixed kthread association anymore to enable worker sharing. This can be used in cases where specific kthread association is necessary, for example, when it should have RT priority or be assigned to certain cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:09 +02:00
Steven Rostedt	a1d0ce8213	tracing: Use class->reg() for all registering of events Because kprobes and syscalls need special processing to register events, the class->reg() method was created to handle the differences. But instead of creating a default ->reg for perf and ftrace events, the code was scattered with: if (class->reg) class->reg(); else default_reg(); This is messy and can also lead to bugs. This patch cleans up this code and creates a default reg() entry for the events allowing for the code to directly call the class->reg() without the condition. Reported-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 21:13:14 -04:00
Chase Douglas	d62f85d1e2	tracing/function-graph: Use correct string size for snprintf The nsecs_str string is a local variable defined as: char nsecs_str[5]; It is possible for the snprintf call to use a size value larger than the size of the string. This should not cause a buffer overrun as it is written now due to the value for the string format "%03lu" can not be larger than 1000. However, this change makes it correct. By making the size correct we guard against potential future changes that could actually cause a buffer overrun. Signed-off-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1276619355-18116-1-git-send-email-chase.douglas@canonical.com> [ added 'UL' to number 8 to fix gcc warning comparing it to sizeof() ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 21:11:39 -04:00
Li Zefan	67ead0a6ce	tracing: Remove open-coded __trace_add_event_call() Let trace_module_add_events() and event_trace_init() call __trace_add_event_call(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA37E9.1020106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:55 -04:00
Li Zefan	ffb9f99528	tracing: Remove redundant raw_init callbacks raw_init callback is optional. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA37D4.7070500@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:53 -04:00
Li Zefan	c9d932cf8a	tracing: Remove test of NULL define_fields callback Every event (or event class) has it's define_fields callback, so the test is redundant. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA37BC.8080707@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:52 -04:00
Li Zefan	8728fe501e	tracing: Don't allocate common fields for every trace events Every event has the same common fields, so it's a big waste of memory to have a copy of those fields for every event. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3759.30105@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:46 -04:00
Li Zefan	c9642c49aa	tracing: Use a global field list for all syscall exit events All syscall exit events have the same fields. The kernel size drops 2.5K: text data bss dec hex filename 7018612 2034376 7251132 16304120 f8c7f8 vmlinux.o.orig 7018612 2031888 7251132 16301632 f8be40 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3746.8070100@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:44 -04:00
Thomas Gleixner	f384c954c9	Merge branch 'linus' into perf/core Reason: Further changes conflict with upstream fixes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-06-28 22:33:24 +02:00
Linus Torvalds	5904b3b81d	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix undeclared ENOSYS in include/linux/tracepoint.h perf record: prevent kill(0, SIGTERM); perf session: Remove threads from tree on PERF_RECORD_EXIT perf/tracing: Fix regression of perf losing kprobe events perf_events: Fix Intel Westmere event constraints perf record: Don't call newt functions when not initialized	2010-06-28 12:24:43 -07:00
Linus Torvalds	f3866db8f7	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Deal with desc->set_type() changing desc->chip	2010-06-28 12:23:12 -07:00

... 5 6 7 8 9 ...

10552 Commits