perf_event_init_task() should clear child->perf_event_ctxp[]
before anything else. Otherwise, if
perf_event_init_context(perf_hw_context) fails,
perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
copied from parent->perf_event_ctxp[] by dup_task_struct().
Also move the initialization of perf_event_mutex and
perf_event_list from perf_event_init_context() to
perf_event_init_context().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
LKML-Reference: <20110119182228.GC12183@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
find_get_context() must not install the new perf_event_context
if the task has already passed perf_event_exit_task().
If nothing else, this means the memory leak. Initially
ctx->refcount == 2, it is supposed that
perf_event_exit_task_context() should participate and do the
necessary put_ctx().
find_lively_task_by_vpid() checks PF_EXITING but this buys
nothing, by the time we call find_get_context() this task can be
already dead. To the point, cmpxchg() can succeed when the task
has already done the last schedule().
Change find_get_context() to populate task->perf_event_ctxp[]
under task->perf_event_mutex, this way we can trust PF_EXITING
because perf_event_exit_task() takes the same mutex.
Also, change perf_event_exit_task_context() to use
rcu_dereference(). Probably this is not strictly needed, but
with or without this change find_get_context() can race with
setup_new_exec()->perf_event_exit_task(), rcu_dereference()
looks better.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
LKML-Reference: <20110119182207.GB12183@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now if we enable blktrace, cfq has too many messages output to the
trace buffer. It is fine if we don't specify any action mask.
But if I do like this:
blktrace /dev/sdb -a issue -a complete -o - | blkparse -i -
I only want to see 'D' and 'C', while with the following command
dd if=/mnt/ocfs2/test of=/dev/null bs=4k count=1 iflag=direct
I will get(with a 2.6.37 vanilla kernel):
8,16 0 0 0.000000000 0 m N cfq3805 alloced
8,16 0 0 0.000004126 0 m N cfq3805 insert_request
8,16 0 0 0.000004884 0 m N cfq3805 add_to_rr
8,16 0 0 0.000008417 0 m N cfq workload slice:300
8,16 0 0 0.000009557 0 m N cfq3805 set_active wl_prio:0 wl_type:2
8,16 0 0 0.000010640 0 m N cfq3805 fifo= (null)
8,16 0 0 0.000011193 0 m N cfq3805 dispatch_insert
8,16 0 0 0.000012221 0 m N cfq3805 dispatched a request
8,16 0 0 0.000012802 0 m N cfq3805 activate rq, drv=1
8,16 0 1 0.000013181 3805 D R 114759 + 8 [dd]
8,16 0 2 0.000164244 0 C R 114759 + 8 [0]
8,16 0 0 0.000167997 0 m N cfq3805 complete rqnoidle 0
8,16 0 0 0.000168782 0 m N cfq3805 set_slice=100
8,16 0 0 0.000169874 0 m N cfq3805 arm_idle: 8 group_idle: 0
8,16 0 0 0.000170189 0 m N cfq schedule dispatch
8,16 0 0 0.000397938 0 m N cfq3805 slice expired t=0
8,16 0 0 0.000399763 0 m N cfq3805 sl_used=1 disp=1 charge=1 iops=0 sect=8
8,16 0 0 0.000400227 0 m N cfq3805 del_from_rr
8,16 0 0 0.000400882 0 m N cfq3805 put_queue
See, there are 19 lines while I only need 2. I don't think it is
appropriate for a user.
So this patch will disable any messages if the BLK_TC_NOTIFY isn't set.
Now the output for the same command will look like:
8,16 0 1 0.000000000 4908 D R 114759 + 8 [dd]
8,16 0 2 0.000146827 0 C R 114759 + 8 [0]
Yes, it is what I want to see.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fix a bunch of
warning: ‘inline’ is not at beginning of declaration
messages when building a 'make allyesconfig' kernel with -Wextra.
These warnings are trivial to kill, yet rather annoying when building with
-Wextra.
The more we can cut down on pointless crap like this the better (IMHO).
A previous patch to do this for a 'allnoconfig' build has already been
merged. This just takes the cleanup a little further.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous
cgroup's cfs_rq. This leads to an opportunity for a
cgroup-destroy to come in and free the cgroup (there are no
active tasks left in it after all) to which the not-quite dead
task is still enqueued.
Reported-by: Miklos Vajna <vmiklos@frugalware.org>
Fixed-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1293206353.29444.205.camel@laptop>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Validate cpu early in perf_event_alloc()
perf: Find_get_context: fix the per-cpu-counter check
perf: Fix contexted inheritance
Starting from perf_event_alloc()->perf_init_event(), the kernel
assumes that event->cpu is either -1 or the valid CPU number.
Change perf_event_alloc() to validate this argument early. This
also means we can remove the similar check in
find_get_context().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: gregkh@suse.de
Cc: stable@kernel.org
LKML-Reference: <20110118161032.GC693@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If task == NULL, find_get_context() should always check that cpu
is correct.
Afaics, the bug was introduced by 38a81da2 "perf events: Clean
up pid passing", but even before that commit "&& cpu != -1" was
not exactly right, -ESRCH from find_task_by_vpid() is not
accurate.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: gregkh@suse.de
Cc: stable@kernel.org
LKML-Reference: <20110118161008.GB693@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus reported that the RCU lockdep annotation bits triggered for this
rcu_dereference() because we're not holding rcu_read_lock().
Going over the code I cannot convince myself its correct:
- holding a ref on the parent_ctx, doesn't avoid it being uncloned
concurrently (as the comment says), so we can race with a free.
- holding parent_ctx->mutex doesn't avoid the above free from taking
place either, it would at best avoid parent_ctx from being freed.
I.e. the warning is correct. To fix the bug, serialize against the
unclone_ctx() call by extending the reach of the parent_ctx->lock.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed unsigned comparison may lead to superfluous resched if leftmost
is right of the current task, wasting a few cycles, and inadvertently
_lengthening_ the current task's slice.
Reported-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294202477.9384.5.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now rq->rq_sched_info.bkl_count is not used for rq, scroll
rq->bkl_count into it. Thus we can save some space for rq.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294991859-13246-1-git-send-email-yong.zhang0@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If CONFIG_RT_GROUP_SCHED is set, __sched_setscheduler() fails due to autogroup
not allocating rt_runtime. Free unused/unusable rt_se and rt_rq, redirect RT
tasks to the root task group, and tell __sched_setscheduler() that it's ok.
Reported-and-tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294890890.8089.39.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add autogroup name to cfs_rq and tasks information to /proc/sched_debug.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110111101257.GF4772@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Displaying of group names in /proc/sched_debug was dropped in autogroup
patches. Add group names while displaying cfs_rq and tasks information.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110111101153.GE4772@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previously effective_load would approximate the global load weight present on
a group taking advantage of:
entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.
This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.
However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.
Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110115015817.069769529@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When xt_AUDIT is built as a module, modpost reports a problem.
MODPOST 322 modules
ERROR: "audit_enabled" [net/netfilter/x_tables.ko] undefined!
WARNING: modpost: Found 1 section mismatch(es).
Cc: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: avoid pointless blocked-task warnings
rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status
rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi()
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, olpc: Add missing Kconfig dependencies
x86, mrst: Set correct APB timer IRQ affinity for secondary cpu
x86: tsc: Fix calibration refinement conditionals to avoid divide by zero
x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timekeeping: Make local variables static
time: Rename misnamed minsec argument of clocks_calc_mult_shift()
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove syscall_exit_fields
tracing: Only process module tracepoints once
perf record: Add "nodelay" mode, disabled by default
perf sched: Fix list of events, dropping unsupported ':r' modifier
Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return"
perf top: Fix annotate segv
perf evsel: Fix order of event list deletion
There is no need for syscall_exit_fields as the syscall
exit event class can already host the fields in its structure,
like most other trace events do by default. Use that
default behavior instead.
Following this scheme, we don't need anymore to override the
get_fields() callback of the syscall exit event class either.
Hence both syscall_exit_fields and syscall_get_exit_fields() can
be removed.
Also changed some indentation to keep the following under 80
characters:
".fields = LIST_HEAD_INIT(event_class_syscall_exit.fields),"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4D301C0E.8090408@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open
cgroup can't use simple_lookup(), since that'd override its desired ->d_op.
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the RCU callback-processing kthread has nothing to do, it parks in
a wait_event(). If RCU remains idle for more than two minutes, the
kernel complains about this. This commit changes from wait_event()
to wait_event_interruptible() to prevent the kernel from complaining
just because RCU is idle.
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Thomas Weber <weber@corscience.de>
Tested-by: Russell King <rmk+kernel@arm.linux.org.uk>
Because the adaptive synchronize_srcu_expedited() approach has
worked very well in testing, remove the kernel parameter and
replace it by a C-preprocessor macro. If someone finds problems
with this approach, a more complex and aggressively adaptive
approach might be required.
Longer term, SRCU will be merged with the other RCU implementations,
at which point synchronize_srcu_expedited() will be event driven,
just as synchronize_sched_expedited() currently is. At that point,
there will be no need for this adaptive approach.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry
lock, which caused a lockdep warning.
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
ACPI: fix resource check message
ACPI / Battery: Update information on info notification and resume
ACPI: Drop device flag wake_capable
ACPI: Always check if _PRW is present before trying to evaluate it
ACPI / PM: Check status of power resources under mutexes
ACPI / PM: Rename acpi_power_off_device()
ACPI / PM: Drop acpi_power_nocheck
ACPI / PM: Drop acpi_bus_get_power()
Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
ACPI / Fan: Rework the handling of power resources
ACPI / PM: Register power resource devices as soon as they are needed
ACPI / PM: Register acpi_power_driver early
ACPI / PM: Add function for updating device power state consistently
ACPI / PM: Add function for device power state initialization
ACPI / PM: Introduce __acpi_bus_get_power()
ACPI / PM: Introduce function for refcounting device power resources
ACPI / PM: Add functions for manipulating lists of power resources
ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
ACPICA: Update version to 20101209
...
Add khugepaged to relocate fragmented pages into hugepages if new
hugepages become available. (this is indipendent of the defrag logic that
will have to make new hugepages available)
The fundamental reason why khugepaged is unavoidable, is that some memory
can be fragmented and not everything can be relocated. So when a virtual
machine quits and releases gigabytes of hugepages, we want to use those
freely available hugepages to create huge-pmd in the other virtual
machines that may be running on fragmented memory, to maximize the CPU
efficiency at all times. The scan is slow, it takes nearly zero cpu time,
except when it copies data (in which case it means we definitely want to
pay for that cpu time) so it seems a good tradeoff.
In addition to the hugepages being released by other process releasing
memory, we have the strong suspicion that the performance impact of
potentially defragmenting hugepages during or before each page fault could
lead to more performance inconsistency than allocating small pages at
first and having them collapsed into large pages later... if they prove
themselfs to be long lived mappings (khugepaged scan is slow so short
lived mappings have low probability to run into khugepaged if compared to
long lived mappings).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This increase the size of the mm struct a bit but it is needed to
preallocate one pte for each hugepage so that split_huge_page will not
require a fail path. Guarantee of success is a fundamental property of
split_huge_page to avoid decrasing swapping reliability and to avoid
adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
VM code to learn rolling back in the middle of its pte mangling operations
(if something we need it to learn handling pmd_trans_huge natively rather
being capable of rollback). When split_huge_page runs a pte is needed to
succeed the split, to map the newly splitted regular pages with a regular
pte. This way all existing VM code remains backwards compatible by just
adding a split_huge_page* one liner. The memory waste of those
preallocated ptes is negligible and so it is worth it.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Futex code is smarter than most other gup_fast O_DIRECT code and knows
about the compound internals. However now doing a put_page(head_page)
will not release the pin on the tail page taken by gup-fast, leading to
all sort of refcounting bugchecks. Getting a stable head_page is a little
tricky.
page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called,
otherwise it's guaranteed unnecessary. And if it's a tail page
compound_head has to run atomically inside irq disabled section
__get_user_pages_fast before returning. Otherwise ->first_page won't be a
stable pointer.
Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it
means the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
wait for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn
when we run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are
always safe to call compound_lock and after taking the compound lock on
head page we can finally re-check if the page returned by gup-fast is
still a tail page. in which case we're set and we didn't need to split
the hugepage in order to take a futex on it.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.
Alternative considered:
* a setuid binary
* a daemon with CAP_SYS_RESOURCE
Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.
This patch updated from original patch based on feedback from David
Rientjes.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This warning was added in commit bdff746a39 ("clone: prepare to recycle
CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
no-one is actually using CLONE_STOPPED.
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use modern per_cpu API to increment {soft|hard}irq counters, and use
per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.
This gives better SMP/NUMA locality and saves few instructions per irq.
With small nr_cpuids values (8 for example), kstats_irq was a small array
(less than L1_CACHE_BYTES), potentially source of false sharing.
In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
kstat_irqs_all[NR_IRQS][NR_CPUS] array.
Note: we still populate kstats_irq for all possible irqs in
early_irq_init(). We probably could use on-demand allocations. (Code
included in alloc_descs()). Problem is not all IRQS are used with a prior
alloc_descs() call.
kstat_irqs_this_cpu() is not used anymore, remove it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...
MONOTONIC_RAW clock timestamps are ideally suited for frequency
calculation and also fit well into the original NTP hardpps design. Now
phase and frequency can be adjusted separately: the former based on
REALTIME clock and the latter based on MONOTONIC_RAW clock.
A new function getnstime_raw_and_real is added to timekeeping subsystem to
capture both timestamps at the same time and atomically.
Signed-off-by: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Rodolfo Giometti <giometti@enneenne.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit adds hardpps() implementation based upon the original one from
the NTPv4 reference kernel code from David Mills. However, it is highly
optimized towards very fast syncronization and maximum stickness to PPS
signal. The typical error is less then a microsecond.
To make it sync faster I had to throw away exponential phase filter so
that the full phase offset is corrected immediately. Then I also had to
throw away median phase filter because it gives a bigger error itself if
used without exponential filter.
Maybe we will find an appropriate filtering scheme in the future but it's
not necessary if the signal quality is ok.
Signed-off-by: Alexander Gordeev <lasaine@lvk.cs.msu.su>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Rodolfo Giometti <giometti@enneenne.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4be2c95d ("taskstats: pad taskstats netlink response for aligment
issues on ia64") added a null field to align the taskstats structure but
the discussion centered around ia64. The issue exists on other platforms
with inefficient unaligned access and adding them piecemeal would be an
unmaintainable mess.
This patch uses Dave Miller's suggestion of using a combination of
CONFIG_64BIT && !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
whether alignment is needed.
Note that this will cause breakage on those platforms with applications
like iotop which had hard-coded offsets into the packet to access the
taskstats structure.
The message seen on systems without the alignment fixes looks like: kernel
unaligned access to 0xe000023879dca9bc, ip=0xa000000100133d10
The addresses may vary but resolve to locations inside __delayacct_add_tsk.
iotop makes what I'd call unreasonable assumptions about the contents of a
netlink genetlink packet containing generic attributes. They're typed and
have headers that specify value lengths, so the client can (should)
identify and skip the ones the client doesn't understand.
The kernel, as of version 2.6.36, presented a packet like so:
+--------------------------------+
| genlmsghdr - 4 bytes |
+--------------------------------+
| NLA header - 4 bytes | /* Aggregate header */
+-+------------------------------+
| | NLA header - 4 bytes | /* PID header */
| +------------------------------+
| | pid/tgid - 4 bytes |
| +------------------------------+
| | NLA header - 4 bytes | /* stats header */
| + -----------------------------+ <- oops. aligned on 4 byte boundary
| | struct taskstats - 328 bytes |
+-+------------------------------+
The iotop code expects that the kernel will behave as it did then,
assuming that the packet format is set in stone. The format is set in
stone, but the packet offsets are not. There's nothing in the packet
format that guarantees that the packet will be sent in exactly the same
way. The attribute contents are set (or versioned) and the aggregate
contents are set but they can be anywhere in the packet.
The issue here isn't that an unaligned structure gets passed to userspace,
it's that the NLA infrastructure has something of a weakness: The 4 byte
attribute header may force the payload to be unaligned. The taskstats
structure is created at an unaligned location and then 64-bit values are
operated on inside the kernel, so the unaligned access warnings gets
spewed everywhere.
It's possible to use the unaligned access API to operate on the structure
in the kernel but it seems like a wasted effort to work around userspace
code that isn't following the packet format. Any new additions would also
need the be worked around. It's a maintenance nightmare.
The conclusion of the earlier discussion seemed to be "ok fine, if we have
to break it, don't break it on arches that don't have the problem." Dave
pointed out that the unaligned access problem doesn't only exist on ia64,
but also on other 64-bit arches that don't have efficient unaligned access
and it should be fixed there as well. The committed version of the patch
and this addition keep with the conclusion of that discussion not to break
it unnecessarily, which the pid padding and the packet padding fixes did
do. x86_64 and powerpc don't suffer this problem so they shouldn't suffer
the solution. Other 64-bit architectures do and will, though.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: David S. Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Florian Mickler <florian@mickler.org>
Cc: Guillaume Chazarain <guichaz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently on 64-bit arch the user_namespace is 2096 and when being
kmalloc-ed it resides on a 4k slab wasting 2003 bytes.
If we allocate a separate cache for it and reduce the hash size from 128
to 64 chains the packaging becomes *much* better - the struct is 1072
bytes and the hole between is 98 bytes.
[akpm@linux-foundation.org: s/__initcall/module_init/]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ctl_unnumbered.txt have been removed in Documentation directory so just
also remove this invalid comments
[akpm@linux-foundation.org: fix Documentation/sysctl/00-INDEX, per Dave]
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Cc: Dave Young <hidave.darkstar@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use temporary lr for struct latency_record for improved readability and
fewer columns used. Removed trailing space from output.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_list_lock is used to protect dump_list in kmsg_dumper implementation,
kmsg_dump() uses it to traverse dump_list too. But if there is contention
on the lock, kmsg_dump() will fail, and the valuable kernel message may be
lost.
This patch solves this issue with RCU. Because kmsg_dump() only read the
list, no lock is needed in kmsg_dump(). So that kmsg_dump() will never
fail because of lock contention.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
sysctl.
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.
If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".
[akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
[akpm@linux-foundation.org: coding-style fixup]
[randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For arch which needs USE_GENERIC_SMP_HELPERS, it has to select
USE_GENERIC_SMP_HELPERS, rather than leaving a choice to user, since they
don't provide their own implementions.
Also, move on_each_cpu() to kernel/smp.c, it is strange to put it in
kernel/softirq.c.
For arch which doesn't use USE_GENERIC_SMP_HELPERS, e.g. blackfin, only
on_each_cpu() is compiled.
Signed-off-by: Amerigo Wang <amwang@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to know the reason why system rebooted in support service.
However, we can't inform our customers of the reason because final
messages are lost on current Linux kernel.
This patch improves the situation above because the final messages are
saved by adding kmsg_dump() to reboot, halt, poweroff and
emergency_restart path.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Reviewed-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <0D753D10438DA54287A00B027084269764CE0E54B7@AUSP01VMBX24.collaborationhost.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The minsec argument to clocks_calc_mult_shift() is misnamed. It is used
to clamp the magnitude of the mult factor so that a multiplication with
any value in the given range won't overflow a 64 bit result. Let's
rename it to match the actual usage.
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Acked-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <alpine.LFD.2.00.1101111207140.17086@xanadu.home>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.
This patch adds POLL/IRQ/NMI notification types support.
Because the memory area used to transfer hardware error information
from BIOS to Linux can be determined only in NMI, IRQ or timer
handler, but general ioremap can not be used in atomic context, so a
special version of atomic ioremap is implemented for that.
Known issue:
- Error information can not be printed for recoverable errors notified
via NMI, because printk is not NMI-safe. Will fix this via delay
printing to IRQ context via irq_work or make printk NMI-safe.
v2:
- adjust printk format per comments.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The nested NOT_RUNNING test in worker_clr_flags() is slightly
misleading in that if NOT_RUNNING were a single flag the nested test
would be always %true and thus noop. Add a comment noting that the
test isn't a noop.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Currently, the lockdep annotation in flush_work() requires exclusive
access on the workqueue the target work is queued on and triggers
warning if a work is trying to flush another work on the same
workqueue; however, this is no longer true as workqueues can now
execute multiple works concurrently.
This patch adds lock_map_acquire_read() and make process_one_work()
hold read access to the workqueue while executing a work and
start_flush_work() check for write access if concurrnecy level is one
or the workqueue has a rescuer (as only one execution resource - the
rescuer - is guaranteed to be available under memory pressure), and
read access if higher.
This better represents what's going on and removes spurious lockdep
warnings which are triggered by fake dependency chain created through
flush_work().
* Peter pointed out that flushing another work from a WQ_MEM_RECLAIM
wq breaks forward progress guarantee under memory pressure.
Condition check accordingly updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
The comment about why rt_mutex_next_owner() can return NULL in
wake_futex_pi() is not the normal case.
Tracing the cause of why this occurs is more likely that waiter
simply timedout. But because it originally caused contention with
the futex, the owner will go into the kernel when it unlocks
the lock. Then it will hit this code path and
rt_mutex_next_owner() will return NULL.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (30 commits)
MAINTAINERS: Add tomoyo-dev-en ML.
SELinux: define permissions for DCB netlink messages
encrypted-keys: style and other cleanup
encrypted-keys: verify datablob size before converting to binary
trusted-keys: kzalloc and other cleanup
trusted-keys: additional TSS return code and other error handling
syslog: check cap_syslog when dmesg_restrict
Smack: Transmute labels on specified directories
selinux: cache sidtab_context_to_sid results
SELinux: do not compute transition labels on mountpoint labeled filesystems
This patch adds a new security attribute to Smack called SMACK64EXEC. It defines label that is used while task is running.
SELinux: merge policydb_index_classes and policydb_index_others
selinux: convert part of the sym_val_to_name array to use flex_array
selinux: convert type_val_to_struct to flex_array
flex_array: fix flex_array_put_ptr macro to be valid C
SELinux: do not set automatic i_ino in selinuxfs
selinux: rework security_netlbl_secattr_to_sid
SELinux: standardize return code handling in selinuxfs.c
SELinux: standardize return code handling in selinuxfs.c
SELinux: standardize return code handling in policydb.c
...
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
mkuboot.sh: Fail if mkimage is missing
gen_init_cpio: checkpatch fixes
gen_init_cpio: Avoid race between call to stat() and call to open()
modpost: Fix address calculation in reloc_location()
Make fixdep error handling more explicit
checksyscalls: Fix stand-alone usage
modpost: Put .zdebug* section on white list
kbuild: fix interaction of CONFIG_IKCONFIG and KCONFIG_CONFIG
kbuild: export linux/{a.out,kvm,kvm_para}.h on headers_install_all
kbuild: introduce HDR_ARCH_LIST for headers_install_all
headers_install: check exit status of unifdef
gen_init_cpio: remove leading `/' from file names
scripts/genksyms: fix header usage
fixdep: use hash table instead of a single array
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
spi / PM: Support dev_pm_ops
PM: Prototype the pm_generic_ operations
PM / Runtime: Generic resume shouldn't set RPM_ACTIVE unconditionally
PM: Use dev_name() in core device suspend and resume routines
PM: Permit registration of parentless devices during system suspend
PM: Replace the device power.status field with a bit field
PM: Remove redundant checks from core device resume routines
PM: Use a different list of devices for each stage of device suspend
PM: Avoid compiler warning in pm_noirq_op()
PM: Use pm_wakeup_pending() in __device_suspend()
PM / Wakeup: Replace pm_check_wakeup_events() with pm_wakeup_pending()
PM: Prevent dpm_prepare() from returning errors unnecessarily
PM: Fix references to basic-pm-debugging.txt in drivers-testing.txt
PM / Runtime: Add synchronous runtime interface for interrupt handlers (v3)
PM / Hibernate: When failed, in_suspend should be reset
PM / Hibernate: hibernation_ops->leave should be checked too
Freezer: Fix a race during freezing of TASK_STOPPED tasks
PM: Use proper ccflag flag in kernel/power/Makefile
PM / Runtime: Fix comments to match runtime callback code
We normally just use the BIO_UPTODATE flag to signal 0/-EIO. If
we have more information available, we should pass that along to
the trace output.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Conflicts:
security/smack/smack_lsm.c
Verified and added fix by Stephen Rothwell <sfr@canb.auug.org.au>
Ok'd by Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: James Morris <jmorris@namei.org>
DEFINE_TRACE should also exist when CONFIG_EVENT_TRACING=n. Otherwise, setting
only TRACEPOINTS=y is broken.
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20101028153117.GA4051@Krystal>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While running my ftrace stress test, this showed up:
BUG: sleeping function called from invalid context at mm/mmap.c:233
...
note: cat[3293] exited with preempt_count 1
The bug was introduced by commit 91e86e560d
("tracing: Fix recursive user stack trace")
Cc: <stable@kernel.org>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4D0089AC.1020802@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
gameport: use this_cpu_read instead of lookup
x86: udelay: Use this_cpu_read to avoid address calculation
x86: Use this_cpu_inc_return for nmi counter
x86: Replace uses of current_cpu_data with this_cpu ops
x86: Use this_cpu_ops to optimize code
vmstat: User per cpu atomics to avoid interrupt disable / enable
irq_work: Use per cpu atomics instead of regular atomics
cpuops: Use cmpxchg for xchg to avoid lock semantics
x86: this_cpu_cmpxchg and this_cpu_xchg operations
percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
percpu,x86: relocate this_cpu_add_return() and friends
connector: Use this_cpu operations
xen: Use this_cpu_inc_return
taskstats: Use this_cpu_ops
random: Use this_cpu_inc_return
fs: Use this_cpu_inc_return in buffer.c
highmem: Use this_cpu_xx_return() operations
vmstat: Use this_cpu_inc_return for vm statistics
x86: Support for this_cpu_add, sub, dec, inc_return
percpu: Generic support for this_cpu_add, sub, dec, inc_return
...
Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
usb: don't use flush_scheduled_work()
speedtch: don't abuse struct delayed_work
media/video: don't use flush_scheduled_work()
media/video: explicitly flush request_module work
ioc4: use static work_struct for ioc4_load_modules()
init: don't call flush_scheduled_work() from do_initcalls()
s390: don't use flush_scheduled_work()
rtc: don't use flush_scheduled_work()
mmc: update workqueue usages
mfd: update workqueue usages
dvb: don't use flush_scheduled_work()
leds-wm8350: don't use flush_scheduled_work()
mISDN: don't use flush_scheduled_work()
macintosh/ams: don't use flush_scheduled_work()
vmwgfx: don't use flush_scheduled_work()
tpm: don't use flush_scheduled_work()
sonypi: don't use flush_scheduled_work()
hvsi: don't use flush_scheduled_work()
xen: don't use flush_scheduled_work()
gdrom: don't use flush_scheduled_work()
...
Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Constify function scope static struct sched_param usage
sched: Fix strncmp operation
sched: Move sched_autogroup_exit() to free_signal_struct()
sched: Fix struct autogroup memory leak
sched: Mark autogroup_init() __init
sched: Consolidate the name of root_task_group and init_task_group
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (36 commits)
serial: apbuart: Fixup apbuart_console_init()
TTY: Add tty ioctl to figure device node of the system console.
tty: add 'active' sysfs attribute to tty0 and console device
drivers: serial: apbuart: Handle OF failures gracefully
Serial: Avoid unbalanced IRQ wake disable during resume
tty: fix typos/errors in tty_driver.h comments
pch_uart : fix warnings for 64bit compile
8250: fix uninitialized FIFOs
ip2: fix compiler warning on ip2main_pci_tbl
specialix: fix compiler warning on specialix_pci_tbl
rocket: fix compiler warning on rocket_pci_ids
8250: add a UPIO_DWAPB32 for 32 bit accesses
8250: use container_of() instead of casting
serial: omap-serial: Add support for kernel debugger
serial: fix pch_uart kconfig & build
drivers: char: hvc: add arm JTAG DCC console support
RS485 documentation: add 16C950 UART description
serial: ifx6x60: fix memory leak
serial: ifx6x60: free IRQ on error
Serial: EG20T: add PCH_UART driver
...
Fixed up conflicts in drivers/serial/apbuart.c with evil merge that
makes the code look fairly sane (unlike either side).
Function-scope statics are discouraged because they are
easily overlooked and can cause subtle bugs/races due to
their global (non-SMP safe) nature.
Linus noticed that we did this for sched_param - at minimum
make the const.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: Message-ID: <AANLkTinotRxScOHEb0HgFgSpGPkq_6jKTv5CfvnQM=ee@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
One of the operands, buf, is incorrect, since it is stripped and the
correct address for subsequent string comparing could change if
leading white spaces, if any, are removed from buf.
It is fixed by replacing buf with cmp.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTinOPuYsVovrZpbuCCmG5deEyc8WgA_A1RJx_YK7@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Seems I lost a change somewhere, leaking memory.
sched: fix struct autogroup memory leak
Add missing change to actually use autogroup_free().
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294222285.8369.2.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
autogroup_init() is only called at boot time.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294375425-31065-1-git-send-email-yong.zhang0@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
root_task_group is the leftover of USER_SCHED, now it's always
same to init_task_group.
But as Mike suggested, root_task_group is maybe the suitable name
to keep for a tree.
So in this patch:
init_task_group --> root_task_group
init_task_group_load --> root_task_group_load
INIT_TASK_GROUP_LOAD --> ROOT_TASK_GROUP_LOAD
Suggested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110107071736.GA32635@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds perf_event_time() to try and centralize access to event
timing and in particular ctx->time. Prepares for cgroup support.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d22059c.122ae30a.5e0e.ffff8b8b@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace all occurrences of:
event->cpu != -1 && event->cpu == smp_processor_id()
by a call to:
event_filter_match(event)
This makes the code more consistent and will make the cgroup
patch smaller.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d220593.2308e30a.48c5.ffff8ae9@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In particular this patch move perf_event_exit_task() before
cgroup_exit() to allow for cgroup support. The cgroup_exit()
function detaches the cgroups attached to a task.
Other movements include hoisting some definitions and inlines
at the top of perf_event.c
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d22058b.cdace30a.4657.ffff95b1@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
blktrace.c block bio complete callback needs to gain a new argument to reflect
the newly added "error" tracepoint argument. This is needed to match the new
block_bio_complete TRACE_EVENT as of
commit de983a7bfcb7c020901ca6e2314cf55a4207ab5a.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Jeff Moyer <jmoyer@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix cgroupfs so as not to do this.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The saving of the ACPI NVS area during hibernation and suspend and
restoring it during the subsequent resume is entirely specific to
ACPI, so move it to drivers/acpi and drop the CONFIG_SUSPEND_NVS
configuration option which is redundant.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
When ioremap() fails (which might happen for some reason), we nicely
oops in suspend_nvs_save() due to NULL dereference by memcpy() in there.
Fail gracefully instead.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (416 commits)
ARM: DMA: add support for DMA debugging
ARM: PL011: add DMA burst threshold support for ST variants
ARM: PL011: Add support for transmit DMA
ARM: PL011: Ensure IRQs are disabled in UART interrupt handler
ARM: PL011: Separate hardware FIFO size from TTY FIFO size
ARM: PL011: Allow better handling of vendor data
ARM: PL011: Ensure error flags are clear at startup
ARM: PL011: include revision number in boot-time port printk
ARM: vexpress: add sched_clock() for Versatile Express
ARM i.MX53: Make MX53 EVK bootable
ARM i.MX53: Some bug fix about MX53 MSL code
ARM: 6607/1: sa1100: Update platform device registration
ARM: 6606/1: sa1100: Fix platform device registration
ARM i.MX51: rename IPU irqs
ARM i.MX51: Add ipu clock support
ARM: imx/mx27_3ds: Add PMIC support
ARM: DMA: Replace page_to_dma()/dma_to_page() with pfn_to_dma()/dma_to_pfn()
mx51: fix usb clock support
MX51: Add support for usb host 2
arch/arm/plat-mxc/ehci.c: fix errors/typos
...
* 'x86-alternatives-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, suspend: Avoid unnecessary smp alternatives switch during suspend/resume
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, asm: Use fxsaveq/fxrestorq in more places
* 'x86-hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hwmon: Add core threshold notification to therm_throt.c
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, paravirt: Use native_halt on a halt, not native_safe_halt
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
locking, lockdep: Convert sprintf_symbol to %pS
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq: Better struct irqaction layout
* 'x86-security-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
module: Move RO/NX module protection to after ftrace module update
x86: Resume trampoline must be executable
x86: Add RO/NX protection for loadable kernel modules
x86: Add NX protection for kernel data
x86: Fix improper large page preservation
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
MAINTAINERS: Update timer related entries
timers: Use this_cpu_read
timerqueue: Make timerqueue_getnext() static inline
hrtimer: fix timerqueue conversion flub
hrtimers: Convert hrtimers to use timerlist infrastructure
timers: Fixup allmodconfig build issue
timers: Rename timerlist infrastructure to timerqueue
timers: Introduce timerlist infrastructure.
hrtimer: Remove stale comment on curr_timer
timer: Warn when del_timer_sync() is called in hardirq context
timer: Del_timer_sync() can be used in softirq context
timer: Make try_to_del_timer_sync() the same on SMP and UP
posix-timers: Annotate lock_timer()
timer: Permit statically-declared work with deferrable timers
time: Use ARRAY_SIZE macro in timecompare.c
timer: Initialize the field slack of timer_list
timer_list: Remove alignment padding on 64 bit when CONFIG_TIMER_STATS
time: Compensate for rounding on odd-frequency clocksources
Fix up trivial conflict in MAINTAINERS
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
sched: Change wait_for_completion_*_timeout() to return a signed long
sched, autogroup: Fix reference leak
sched, autogroup: Fix potential access to freed memory
sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
sched: Move periodic share updates to entity_tick()
printk: Use this_cpu_{read|write} api on printk_pending
sched: Make pushable_tasks CONFIG_SMP dependant
sched: Add 'autogroup' scheduling feature: automated per session task groups
sched: Fix unregister_fair_sched_group()
sched: Remove unused argument dest_cpu to migrate_task()
mutexes, sched: Introduce arch_mutex_cpu_relax()
sched: Add some clock info to sched_debug
cpu: Remove incorrect BUG_ON
cpu: Remove unused variable
sched: Fix UP build breakage
sched: Make task dump print all 15 chars of proc comm
sched: Update tg->shares after cpu.shares write
sched: Allow update_cfs_load() to update global load
sched: Implement demand based update_cfs_load()
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (146 commits)
tools, perf: Documentation for the power events API
perf: Add calls to suspend trace point
perf script: Make some lists static
perf script: Use the default lost event handler
perf session: Warn about errors when processing pipe events too
perf tools: Fix perf_event.h header usage
perf test: Clarify some error reports in the open syscall test
x86, NMI: Add touch_nmi_watchdog to io_check_error delay
x86: Avoid calling arch_trigger_all_cpu_backtrace() at the same time
x86: Only call smp_processor_id in non-preempt cases
perf timechart: Adjust perf timechart to the new power events
perf: Clean up power events by introducing new, more generic ones
perf: Do not export power_frequency, but power_start event
perf test: Add test for counting open syscalls
perf evsel: Auto allocate resources needed for some methods
perf evsel: Use {cpu,thread}_map to shorten list of parameters
perf tools: Refactor all_tids to hold nr and the map
perf tools: Refactor cpumap to hold nr and the map
perf evsel: Introduce per cpu and per thread open helpers
perf evsel: Steal the counter reading routines from stat
...
* 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: Add futex_q static initializer
futex: Replace fshared and clockrt with combined flags
futex: Cleanup stale fshared flag interfaces
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: remove unused __list_for_each_rcu() macro
rculist: fix borked __list_for_each_rcu() macro
rcu: reduce __call_rcu()-induced contention on rcu_node structures
rcu: limit rcu_node leaf-level fanout
rcu: fine-tune grace-period begin/end checks
rcu: Keep gpnum and completed fields synchronized
rcu: Stop chasing QS if another CPU did it for us
rcu: increase synchronize_sched_expedited() batching
rcu: Make synchronize_srcu_expedited() fast if running readers
rcu: fix race condition in synchronize_sched_expedited()
rcu: update documentation/comments for Lai's adoption patch
rcu,cleanup: simplify the code when cpu is dying
rcu,cleanup: move synchronize_sched_expedited() out of sched.c
rcu: get rid of obsolete "classic" names in TREE_RCU tracing
rcu: Distinguish between boosting and boosted
rcu: document TINY_RCU and TINY_PREEMPT_RCU tracing.
rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCU
rcu: priority boosting for TINY_PREEMPT_RCU
rcu: move TINY_RCU from softirq to kthread
rcu: add priority-inversion testing to rcutorture
Uses the machine_suspend trace point, called from the
generic kernel suspend_devices_and_enter function.
Signed-off-by: Jean Pihet <j-pihet@ti.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Arjan van de Ven <arjan@linux.intel.com>
CC: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-pm@lists.linux-foundation.org
LKML-Reference: <1294253342-29056-2-git-send-email-j-pihet@ti.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
wait_for_completion_*_timeout() can return:
0: if the wait timed out
-ve: if the wait was interrupted
+ve: if the completion was completed.
As they currently return an 'unsigned long', the last two cases
are not easily distinguished which can easily result in buggy
code, as is the case for the recently added
wait_for_completion_interruptible_timeout() call in
net/sunrpc/cache.c
So change them both to return 'long'. As MAX_SCHEDULE_TIMEOUT
is LONG_MAX, a large +ve return value should never overflow.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The cgroup exit mess also uncovered a struct autogroup reference leak.
copy_process() was simply freeing vs putting the signal_struct,
stranding a reference.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <1293784350.6839.2.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Oleg pointed out that the /proc interface kref_get() useage may race with
the final put during autogroup_move_group(). A signal->autogroup assignment
may be in flight when the /proc interface dereference, leaving them taking
a reference to an already dead group.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292508592.5940.28.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add these new power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend
The old C-state/idle accounting events:
power:power_start
power:power_end
Have now a replacement (but we are still keeping the old
tracepoints for compatibility):
power:cpu_idle
and
power:power_frequency
is replaced with:
power:cpu_frequency
power:machine_suspend is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.
the type= field got removed from both, it was never
used and the type is differed by the event type itself.
perf timechart userspace tool gets adjusted in a separate patch.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-3-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.
intel_idle can a be module though...
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Jean Pihet <j-pihet@ti.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-2-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
The error message 'NMI watchdog failed to create perf event...'
does not make it clear that this is a fatal error for the
watchdog. It also currently prints the error value as a
pointer, rather than extracting the error code with PTR_ERR().
Fix that.
Add a note to the description of the 'nowatchdog' kernel
parameter to associate it with this message.
Reported-by: Cesare Leonardi <celeonar@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 599368@bugs.debian.org
Cc: 608138@bugs.debian.org
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org> # .37.x and later
LKML-Reference: <1294009362.3167.126.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When racing on adding into user cache, the new allocated from mm slab
is freed without putting user namespace.
Since the user namespace is already operated by getting, putting has
to be issued.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ring_buffer: Off-by-one and duplicate events in ring_buffer_read_page
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: print out alloc information with KERN_DEBUG instead of KERN_INFO
kthread_work: make lockdep happy
To avoid confusion with the meaning and return value of
pm_check_wakeup_events() replace it with pm_wakeup_pending() that
will work the other way around (ie. return true when system-wide
power transition should be aborted).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
When hibernation failed due to an error in swsusp_write() called by
hibernate(), it skips calling "power_down()" and returns. When
hibernate() is called again (probably after fixing up so that
swsusp_write() wouldn't fail again), before "in_suspend = 1" of
create_image is called, in_suspend should be 0. However, because
hibernate() did not reset "in_suspend" after a failure, it's already 1.
This patch fixes such inconsistency of "in_suspend" value.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Because hibernate calls hibernation_ops->leave() without checking
whether hibernation_ops->leave is NULL or not, hiberantion_set_ops
should WARN_ON if hibernation_ops->leave is NULL.
This patch added one more condition to check hibernation_ops->leave.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
After calling freeze_task(), try_to_freeze_tasks() see whether the
task is stopped or traced and if so, considers it to be frozen;
however, nothing guarantees that either the task being frozen sees
TIF_FREEZE or the freezer sees TASK_STOPPED -> TASK_RUNNING
transition. The task being frozen may wake up and not see TIF_FREEZE
while the freezer fails to notice the transition and believes the task
is still stopped.
This patch fixes the race by making freeze_task() always go through
fake_signal_wake_up() for applicable tasks. The function goes through
the target task's scheduler lock and thus guarantees that either the
target sees TIF_FREEZE or try_to_freeze_task() sees TASK_RUNNING.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Use the ccflags-$ flag instead of EXTRA_CFLAGS because EXTRA_CFLAGS is
deprecated and should now be switched. According to
(documentation/kbuild/makefiles.txt).
Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Fix two related problems in the event-copying loop of
ring_buffer_read_page.
The loop condition for copying events is off-by-one.
"len" is the remaining space in the caller-supplied page.
"size" is the size of the next event (or two events).
If len == size, then there is just enough space for the next event.
size was set to rb_event_ts_length, which may include the size of two
events if the first event is a time-extend, in order to assure time-
extends are kept together with the event after it. However,
rb_advance_reader always advances by one event. This would result in the
event after any time-extend being duplicated. Instead, get the size of
a single event for the memcpy, but use rb_event_ts_length for the loop
condition.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1293064704-8101-1-git-send-email-dhsharp@google.com>
LKML-Reference: <AANLkTin7nLrRPc9qGjdjHbeVDDWiJjAiYyb-L=gH85bx@mail.gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit:
84e1c6bb38
x86: Add RO/NX protection for loadable kernel modules
Broke the function tracer with this output:
------------[ cut here ]------------
WARNING: at kernel/trace/ftrace.c:1014 ftrace_bug+0x114/0x171()
Hardware name: Precision WorkStation 470
Modules linked in: i2c_core(+)
Pid: 86, comm: modprobe Not tainted 2.6.37-rc2+ #68
Call Trace:
[<ffffffff8104e957>] warn_slowpath_common+0x85/0x9d
[<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
[<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
[<ffffffff8104e989>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810a9dfe>] ftrace_bug+0x114/0x171
[<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
[<ffffffff810aa0db>] ftrace_process_locs+0x1ae/0x274
[<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
[<ffffffff810aa29e>] ftrace_module_notify+0x39/0x44
[<ffffffff814405cf>] notifier_call_chain+0x37/0x63
[<ffffffff8106e054>] __blocking_notifier_call_chain+0x46/0x5b
[<ffffffff8106e07d>] blocking_notifier_call_chain+0x14/0x16
[<ffffffff8107ffde>] sys_init_module+0x73/0x1f3
[<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b
---[ end trace 2aff4f4ca53ec746 ]---
ftrace faulted on writing [<ffffffffa00026db>]
__process_new_adapter+0x7/0x34 [i2c_core]
The cause was that the module text was set to read only before ftrace
could convert the calls to mcount to nops. Thus, the conversions failed
due to not being able to write to the text locations.
The simple fix is to move setting the module to read only after the
module notifiers are called (where ftrace sets the module mcounts to nops).
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The taskstats structure is internally aligned on 8 byte boundaries but the
layout of the aggregrate reply, with two NLA headers and the pid (each 4
bytes), actually force the entire structure to be unaligned. This causes
the kernel to issue unaligned access warnings on some architectures like
ia64. Unfortunately, some software out there doesn't properly unroll the
NLA packet and assumes that the start of the taskstats structure will
always be 20 bytes from the start of the netlink payload. Aligning the
start of the taskstats structure breaks this software, which we don't
want. So, for now the alignment only happens on architectures that
require it and those users will have to update to fixed versions of those
packages. Space is reserved in the packet only when needed. This ifdef
should be removed in several years e.g. 2012 once we can be confident
that fixed versions are installed on most systems. We add the padding
before the aggregate since the aggregate is already a defined type.
Commit 85893120 ("delayacct: align to 8 byte boundary on 64-bit systems")
previously addressed the alignment issues by padding out the pid field.
This was supposed to be a compatible change but the circumstances
described above mean that it wasn't. This patch backs out that change,
since it was a hack, and introduces a new NULL attribute type to provide
the padding. Padding the response with 4 bytes avoids allocating an
aligned taskstats structure and copying it back. Since the structure
weighs in at 328 bytes, it's too big to do it on the stack.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: Brian Rogers <brian@xyzw.org>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Guillaume Chazarain <guichaz@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Russell King reports:
| On the ARM dev boards, we have a 32-bit counter running at 24MHz. Calling
| clocks_calc_mult_shift(&mult, &shift, 24MHz, NSEC_PER_SEC, 60) gives
| us a multiplier of 2796202666 and a shift of 26.
|
| Over a large counter delta, this produces an error - lets take a count
| from 362976315 to 4280663372:
|
| (4280663372-362976315) * 2796202666 / 2^26 - (4280663372-362976315) * (1000/24)
| => -38.91872422891230269990
|
| Can we do better?
|
| (4280663372-362976315) * 2796202667 / 2^26 - (4280663372-362976315) * (1000/24)
| 19.45936211449532822051
|
| which is about twice as good as the 2796202666 multiplier.
|
| Looking at the equivalent divisions obtained, 2796202666 / 2^26 gives
| 41.66666665673255920410ns per tick, whereas 2796202667 / 2^26 gives
| 41.66666667163372039794ns. The actual value wanted is 1000/24 =
| 41.66666666666666666666ns.
Fix this by ensuring we round to nearest when calculating the
multiplier.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Tested-by: Eric Miao <eric.y.miao@gmail.com>
Tested-by: Olof Johansson <olof@lixom.net>
Tested-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
spinlock in kthread_worker and wait_queue_head in kthread_work both
should be lockdep sensible, so change the interface to make it
suiltable for CONFIG_LOCKDEP.
tj: comment update
Reported-by: Nicolas <nicolas.mailhot@laposte.net>
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Andy Walls <awalls@md.metrocast.net>
Tested-by: Andy Walls <awalls@md.metrocast.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, destroy_workqueue() makes the workqueue deny all new
queueing by setting WQ_DYING and flushes the workqueue once before
proceeding with destruction; however, there are cases where work items
queue more related work items. Currently, such users need to
explicitly flush the workqueue multiple times depending on the
possible depth of such chained queueing.
This patch updates the queueing path such that a work item can queue
further work items on the same workqueue even when WQ_DYING is set.
The flush on destruction is automatically retried until the workqueue
is empty. This guarantees that the workqueue is empty on destruction
while allowing chained queueing.
The flush retry logic whines if it takes too many retries to drain the
workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Linus reported that the new warning introduced by commit f26f9aff6a
"Sched: fix skip_clock_update optimization" triggers. The need_resched
flag can be set by other CPUs asynchronously so this debug check is
bogus - remove it.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <AANLkTinJ8hAG1TpyC+CSYPR47p48+1=E7fiC45hMXT_1@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32: Make sure we can map all of lowmem if we need to
x86, vt-d: Handle previous faults after enabling fault handling
x86: Enable the intr-remap fault handling after local APIC setup
x86, vt-d: Fix the vt-d fault handling irq migration in the x2apic mode
x86, vt-d: Quirk for masking vtd spec errors to platform error handling logic
x86, xsave: Use alloc_bootmem_align() instead of alloc_bootmem()
bootmem: Add alloc_bootmem_align()
x86, gcc-4.6: Use gcc -m options when building vdso
x86: HPET: Chose a paranoid safe value for the ETIME check
x86: io_apic: Avoid unused variable warning when CONFIG_GENERIC_PENDING_IRQ=n
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Fix off by one in perf_swevent_init()
perf: Fix duplicate events with multiple-pmu vs software events
ftrace: Have recordmcount honor endianness in fn_ELF_R_INFO
scripts/tags.sh: Add magic for trace-events
tracing: Fix panic when lseek() called on "trace" opened for writing
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix the irqtime code for 32bit
sched: Fix the irqtime code to deal with u64 wraps
nohz: Fix get_next_timer_interrupt() vs cpu hotplug
Sched: fix skip_clock_update optimization
sched: Cure more NO_HZ load average woes
Mike Galbraith reported poor interactivity[*] when the new shares distribution
code was combined with autogroups.
The root cause turns out to be a mis-ordering of accounting accrued execution
time and shares updates. Since update_curr() is issued hierarchically,
updating the parent entity weights to reflect child enqueue/dequeue results in
the parent's unaccounted execution time then being accrued (vs vruntime) at the
new weight as opposed to the weight present at accumulation.
While this doesn't have much effect on processes with timeslices that cross a
tick, it is particularly problematic for an interactive process (e.g. Xorg)
which incurs many (tiny) timeslices. In this scenario almost all updates are
at dequeue which can result in significant fairness perturbation (especially if
it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
transitions).
Correct this by ensuring unaccounted time is accumulated prior to manipulating
an entity's weight.
[*] http://xkcd.com/619/ is perversely Nostradamian here.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20101216031038.159704378@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Long running entities that do not block (dequeue) require periodic updates to
maintain accurate share values. (Note: group entities with several threads are
quite likely to be non-blocking in many circumstances).
By virtue of being long-running however, we will see entity ticks (otherwise
the required update occurs in dequeue/put and we are done). Thus we can move
the detection (and associated work) for these updates into the periodic path.
This restores the 'atomicity' of update_curr() with respect to accounting.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101216031038.067028969@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
x86: avoid high BIOS area when allocating address space
x86: avoid E820 regions when allocating address space
x86: avoid low BIOS area when allocating address space
resources: add arch hook for preventing allocation in reserved areas
Revert "resources: support allocating space within a region from the top down"
Revert "PCI: allocate bus resources from the top down"
Revert "x86/PCI: allocate space from the end of a region, not the beginning"
Revert "x86: allocate space within a region top-down"
Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode"
PCI: Update MCP55 quirk to not affect non HyperTransport variants
The irq work queue is a per cpu object and it is sufficient for
synchronization if per cpu atomics are used. Doing so simplifies
the code and reduces the overhead of the code.
Before:
christoph@linux-2.6$ size kernel/irq_work.o
text data bss dec hex filename
451 8 1 460 1cc kernel/irq_work.o
After:
christoph@linux-2.6$ size kernel/irq_work.o
text data bss dec hex filename
438 8 1 447 1bf kernel/irq_work.o
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Christoph Lameter <cl@linux.com>
When the current __call_rcu() function was written, the expedited
APIs did not exist. The __call_rcu() implementation therefore went
to great lengths to detect the end of old grace periods and to start
new ones, all in the name of reducing grace-period latency. Now the
expedited APIs do exist, and the usage of __call_rcu() has increased
considerably. This commit therefore causes __call_rcu() to avoid
worrying about grace periods unless there are a large number of
RCU callbacks stacked up on the current CPU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Some recent benchmarks have indicated possible lock contention on the
leaf-level rcu_node locks. This commit therefore limits the number of
CPUs per leaf-level rcu_node structure to 16, in other words, there
can be at most 16 rcu_data structures fanning into a given rcu_node
structure. Prior to this, the limit was 32 on 32-bit systems and 64 on
64-bit systems.
Note that the fanout of non-leaf rcu_node structures is unchanged. The
organization of accesses to the rcu_node tree is such that references
to non-leaf rcu_node structures are much less frequent than to the
leaf structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
should try to report a quiescent state. Handle overflow in the check
for rdp->gpnum having fallen behind.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When a CPU that was in an extended quiescent state wakes
up and catches up with grace periods that remote CPUs
completed on its behalf, we update the completed field
but not the gpnum that keeps a stale value of a backward
grace period ID.
Later, note_new_gpnum() will interpret the shift between
the local CPU and the node grace period ID as some new grace
period to handle and will then start to hunt quiescent state.
But if every grace periods have already been completed, this
interpretation becomes broken. And we'll be stuck in clusters
of spurious softirqs because rcu_report_qs_rdp() will make
this broken state run into infinite loop.
The solution, as suggested by Lai Jiangshan, is to ensure that
the gpnum and completed fields are well synchronized when we catch
up with completed grace periods on their behalf by other cpus.
This way we won't start noting spurious new grace periods.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When a CPU is idle and others CPUs handled its extended
quiescent state to complete grace periods on its behalf,
it will catch up with completed grace periods numbers
when it wakes up.
But at this point there might be no more grace period to
complete, but still the woken CPU always keeps its stale
qs_pending value and will then continue to chase quiescent
states even if its not needed anymore.
This results in clusters of spurious softirqs until a new
real grace period is started. Because if we continue to
chase quiescent states but we have completed every grace
periods, rcu_report_qs_rdp() is puzzled and makes that
state run into infinite loops.
As suggested by Lai Jiangshan, just reset qs_pending if
someone completed every grace periods on our behalf.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The fix in commit #6a0cc49 requires more than three concurrent instances
of synchronize_sched_expedited() before batching is possible. This
patch uses a ticket-counter-like approach that is also not unrelated to
Lai Jiangshan's Ring RCU to allow sharing of expedited grace periods even
when there are only two concurrent instances of synchronize_sched_expedited().
This commit builds on Tejun's original posting, which may be found at
http://lkml.org/lkml/2010/11/9/204, adding memory barriers, avoiding
overflow of signed integers (other than via atomic_t), and fixing the
detection of batching.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This adds arch_remove_reservations(), which an arch can implement if it
needs to protect part of the address space from allocation.
Sometimes that can be done by just putting a region in the resource tree,
but there are cases where that doesn't work well. For example, x86 BIOS
E820 reservations are not related to devices, so they may overlap part of,
all of, or more than a device resource, so they may not end up at the
correct spot in the resource tree.
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Use this_cpu_inc_return in one place and avoid ugly __raw_get_cpu in
another.
V3->V4:
- Fix off by one.
V4-V4f:
- Use &listener_array
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
__get_cpu_var() can be replaced with this_cpu_read and will then use a
single read instruction with implied address calculation to access the
correct per cpu instance.
However, the address of a per cpu variable passed to __this_cpu_read()
cannot be determined (since it's an implied address conversion through
segment prefixes). Therefore apply this only to uses of __get_cpu_var
where the address of the variable is not used.
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use this_cpu ops in various places to optimize per cpu data access.
Cc: Jason Baron <jbaron@redhat.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
tty: add 'active' sysfs attribute to tty0 and console device
Userspace can query the actual virtual console, and the configured
console devices behind /dev/tt0 and /dev/console.
The last entry in the list of devices is the active device, analog
to the console= kernel command line option.
The attribute supports poll(), which is raised when the virtual
console is changed or /dev/console is reconfigured.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
index 0000000..b138b66
Commit 3624eb0 (PM / Hibernate: Modify signature used to mark swap)
attempted to modify hibernate signature used to mark swap partitions
containing hibernation images, so that old kernels don't try to
handle compressed images. However, this change broke resume from
hibernation on Fedora 14 that apparently doesn't pass the resume=
argument to the kernel and tries to trigger resume from early user
space. This doesn't work, because the signature is now different,
so the old signature has to be restored to avoid the problem.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=22732 .
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Reported-by: Zhang Rui <rui.zhang@intel.com>
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The user-space hibernation sends a wrong notification after the image
restoration because of thinko for the file flag check. RDONLY
corresponds to hibernation and WRONLY to restoration, confusingly.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org
Simple sysfs emumeration of the PMUs.
Use a "event_source" bus, and add PMU devices using their name.
Each PMU device has a type attribute which contrains the value needed
for perf_event_attr::type to identify this PMU.
This is the minimal stub needed to start using this interface,
we'll consider extending the sysfs usage later.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101117222056.316982569@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend the perf_pmu_register() interface to allow for named and
dynamic pmu types.
Because we need to support the existing static types we cannot use
dynamic types for everything, hence provide a type argument.
If we want to enumerate the PMUs they need a name, provide one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101117222056.259707703@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently we call perf_event_init() from sched_init(). In order to
make it more obvious move it to the cannnonical location.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101117222056.093629821@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the irqtime accounting is using non-atomic u64 and can be read
from remote cpus (writes are strictly cpu local, reads are not) we
have to deal with observing partial updates.
When we do observe partial updates the clock movement (in particular,
->clock_task movement) will go funny (in either direction), a
subsequent clock update (observing the full update) will make it go
funny in the oposite direction.
Since we rely on these clocks to be strictly monotonic we cannot
suffer backwards motion. One possible solution would be to simply
ignore all backwards deltas, but that will lead to accounting
artefacts, most notable: clock_task + irq_time != clock, this
inaccuracy would end up in user visible stats.
Therefore serialize the reads using a seqcount.
Reviewed-by: Venkatesh Pallipadi <venki@google.com>
Reported-by: Mikael Pettersson <mikpe@it.uu.se>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292242434.6803.200.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some ARM systems have a short sched_clock() [ which needs to be fixed
too ], but this exposed a bug in the irq_time code as well, it doesn't
deal with wraps at all.
Fix the irq_time code to deal with u64 wraps by re-writing the code to
only use delta increments, which avoids the whole issue.
Reviewed-by: Venkatesh Pallipadi <venki@google.com>
Reported-by: Mikael Pettersson <mikpe@it.uu.se>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292242433.6803.199.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The perf_swevent_enabled[] array has PERF_COUNT_SW_MAX elements.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101024195041.GT5985@bicker>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: It is likely that WORKER_NOT_RUNNING is true
MAINTAINERS: Add workqueue entry
workqueue: check the allocation of system_unbound_wq
If you try to build a kernel with KCONFIG_CONFIG set (to a value
not equal to .config) and that config sets CONFIG_IKCONFIG then the
build will fail with:
make[1]: *** No rule to make target `.config', needed by \
`kernel/config_data.gz'. Stop.
because the kernel/Makefile contains a direct reference to .config.
This issue has been present since the introduction of KCONFIG_CONFIG
in 14cdd3c402.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
CC: Roman Zippel <zippel@linux-m68k.org>
CC: Michal Marek <mmarek@suse.cz>
Reviewed-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Running the annotate branch profiler on three boxes, including my
main box that runs firefox, evolution, xchat, and is part of the distcc farm,
showed this with the likelys in the workqueue code:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
96 996253 99 wq_worker_sleeping workqueue.c 703
96 996247 99 wq_worker_waking_up workqueue.c 677
The likely()s in this case were assuming that WORKER_NOT_RUNNING will
most likely be false. But this is not the case. The reason is
(and shown by adding trace_printks and testing it) that most of the time
WORKER_PREP is set.
In worker_thread() we have:
worker_clr_flags(worker, WORKER_PREP);
[ do work stuff ]
worker_set_flags(worker, WORKER_PREP, false);
(that 'false' means not to wake up an idle worker)
The wq_worker_sleeping() is called from schedule when a worker thread
is putting itself to sleep. Which happens most of the time outside
of that [ do work stuff ].
The wq_worker_waking_up is called by the wakeup worker code, which
is also callod outside that [ do work stuff ].
Thus, the likely and unlikely used by those two functions are actually
backwards.
Remove the annotation and let gcc figure it out.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
During suspend, we disable all the non boot cpus. And during resume we bring
them all back again. So no need to do alternatives_smp_switch() in between.
On my core 2 based laptop, this speeds up the suspend path by 15msec and the
resume path by 5 msec (suspend/resume speed up differences can be attributed
to the different P-states that the cpu is in during suspend/resume).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <1290557500.4946.8.camel@sbsiddha-MOBL3.sc.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Eric asked for this.
[tglx: Because it generates faster code according to Erics ]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-mm@kvack.org
LKML-Reference: <alpine.DEB.2.00.1011301404490.4039@router.home>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In converting the hrtimers to timerqueue, I missed
a spot in hrtimer_run_queues where we loop running
timers. We end up not pulling the new next value out
and instead just use the last next value, causing
boot time hangs in some cases.
The proper fix is to pull timerqueue_getnext each iteration
instead of using a local next value.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Converts the hrtimer code to use the new timerlist infrastructure
Signed-off-by: John Stultz <john.stultz@linaro.org>
LKML Reference: <1290136329-18291-3-git-send-email-john.stultz@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
CC: Alessandro Zummo <a.zummo@towertech.it>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Richard Cochran <richardcochran@gmail.com>
Originally adapted from Huang Ying's patch which moved the
unknown_nmi_panic to the traps.c file. Because the old nmi
watchdog was deleted before this change happened, the
unknown_nmi_panic sysctl was lost. This re-adds it.
Also, the nmi_watchdog sysctl was re-implemented and its
documentation updated accordingly.
Patch-inspired-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: fweisbec@gmail.com
LKML-Reference: <1291068437-5331-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Eric Paris pointed out that it doesn't make sense to require
both CAP_SYS_ADMIN and CAP_SYSLOG for certain syslog actions.
So require CAP_SYSLOG, not CAP_SYS_ADMIN, when dmesg_restrict
is set.
(I'm also consolidating the now common error path)
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Eric Paris <eparis@redhat.com>
Acked-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
Use the reboot notifier to detach all running counters on reboot, this
solves a problem with kexec where the new kernel doesn't expect
running counters (rightly so).
It will however decrease the coverage of the NMI watchdog. Making a
kexec specific reboot notifier callback would be best, however that
would require touching all notifier callback handlers as they are not
properly structured to deal with new state.
As a compromise, place the perf reboot notifier at the very last
position in the list.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Don Zickus <dzickus@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
__get_cpu_var() is a bit inefficient, lets use __this_cpu_read() and
__this_cpu_write() to manipulate printk_pending.
printk_needs_cpu(cpu) is called only for the current cpu :
Use faster __this_cpu_read().
Remove the redundant unlikely on (cpu_is_offline(cpu)) test:
# size kernel/printk.o*
text data bss dec hex filename
9942 756 263488 274186 42f0a kernel/printk.o.new
9990 756 263488 274234 42f3a kernel/printk.o.old
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290788536.2855.237.camel@edumazet-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As noted by Peter Zijlstra at https://lkml.org/lkml/2010/11/10/391
(while reviewing other stuff, though), tracking pushable tasks
only makes sense on SMP systems.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1291143093.2697.298.camel@Palantir>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This fixes a bug as seen on 2.6.32 based kernels where timers got
enqueued on offline cpus.
If a cpu goes offline it might still have pending timers. These will
be migrated during CPU_DEAD handling after the cpu is offline.
However while the cpu is going offline it will schedule the idle task
which will then call tick_nohz_stop_sched_tick().
That function in turn will call get_next_timer_intterupt() to figure
out if the tick of the cpu can be stopped or not. If it turns out that
the next tick is just one jiffy off (delta_jiffies == 1)
tick_nohz_stop_sched_tick() incorrectly assumes that the tick should
not stop and takes an early exit and thus it won't update the load
balancer cpu.
Just afterwards the cpu will be killed and the load balancer cpu could
be the offline cpu.
On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide
on which cpu a timer should be enqueued (see __mod_timer()). Which
leads to the possibility that timers get enqueued on an offline cpu.
These will never expire and can cause a system hang.
This has been observed 2.6.32 kernels. On current kernels
__mod_timer() uses get_nohz_timer_target() which doesn't have that
problem. However there might be other problems because of the too
early exit tick_nohz_stop_sched_tick() in case a cpu goes offline.
The easiest and probably safest fix seems to be to let
get_next_timer_interrupt() just lie and let it say there isn't any
pending timer if the current cpu is offline.
I also thought of moving migrate_[hr]timers() from CPU_DEAD to
CPU_DYING, but seeing that there already have been fixes at least in
the hrtimer code in this area I'm afraid that this could add new
subtle bugs.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched(). Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.
Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set. Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().
Signed-off-by: Mike Galbraith <efault@gmx.de>
Reported-by: Bjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.
Damien reported that after 74f5187ac8 (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.
The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.
When comming out of NO_HZ mode there are two important things to take
care of:
- Folding the pending idle delta into the global active count.
- Correctly aging the averages for the idle-duration.
So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.
Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.
Reported-by: Damien Wyart <damien.wyart@free.fr>
Reported-by: Tim McGrath <tmhikaru@gmail.com>
Tested-by: Damien Wyart <damien.wyart@free.fr>
Tested-by: Orion Poplawski <orion@cora.nwra.com>
Tested-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Cc: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because the multi-pmu bits can share contexts between struct pmu
instances we could get duplicate events by iterating the pmu list.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/pvclock: Zero last_value on resume
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf record: Fix eternal wait for stillborn child
perf header: Don't assume there's no attr info if no sample ids is provided
perf symbols: Figure out start address of kernel map from kallsyms
perf symbols: Fix kallsyms kernel/module map splitting
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
nohz: Fix printk_needs_cpu() return value on offline cpus
printk: Fix wake_up_klogd() vs cpu hotplug
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM / Hibernate: Fix memory corruption related to swap
PM / Hibernate: Use async I/O when reading compressed hibernation image
There is a problem that swap pages allocated before the creation of
a hibernation image can be released and used for storing the contents
of different memory pages while the image is being saved. Since the
kernel stored in the image doesn't know of that, it causes memory
corruption to occur after resume from hibernation, especially on
systems with relatively small RAM that need to swap often.
This issue can be addressed by keeping the GFP_IOFS bits clear
in gfp_allowed_mask during the entire hibernation, including the
saving of the image, until the system is finally turned off or
the hibernation is aborted. Unfortunately, for this purpose
it's necessary to rework the way in which the hibernate and
suspend code manipulates gfp_allowed_mask.
This change is based on an earlier patch from Hugh Dickins.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable@kernel.org
This is a fix for reading LZO compressed image using async I/O.
Essentially, instead of having just one page into which we keep
reading blocks from swap, we allocate enough of them to cover the
largest compressed size and then let block I/O pick them all up. Once
we have them all (and here we wait), we decompress them, as usual.
Obviously, the very first block we still pick up synchronously,
because we need to know the size of the lot before we pick up the
rest.
Also fixed the copyright line, which I've forgotten before.
Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Use text_poke_smp_batch() on unoptimization path for reducing
the number of stop_machine() issues. If the number of
unoptimizing probes is more than MAX_OPTIMIZE_PROBES(=256),
kprobes unoptimizes first MAX_OPTIMIZE_PROBES probes and kicks
optimizer for remaining probes.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20101203095434.2961.22657.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use text_poke_smp_batch() in optimization path for reducing
the number of stop_machine() issues. If the number of optimizing
probes is more than MAX_OPTIMIZE_PROBES(=256), kprobes optimizes
first MAX_OPTIMIZE_PROBES probes and kicks optimizer for
remaining probes.
Changes in v5:
- Use kick_kprobe_optimizer() instead of directly calling
schedule_delayed_work().
- Rescheduling optimizer outside of kprobe mutex lock.
Changes in v2:
- Allocate code buffer and parameters in arch_init_kprobes()
instead of using static arraies.
- Merge previous max optimization limit patch into this patch.
So, this patch introduces upper limit of optimization at
once.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20101203095428.2961.8994.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reuse unused (waiting for unoptimizing and no user handler)
kprobe on given address instead of returning -EBUSY for
registering a new kprobe.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101203095416.2961.39080.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unoptimization occurs when a probe is unregistered or disabled,
and is heavy because it recovers instructions by using
stop_machine(). This patch delays unoptimization operations and
unoptimize several probes at once by using
text_poke_smp_batch(). This can avoid unexpected system slowdown
coming from stop_machine().
Changes in v5:
- Split this patch into several cleanup patches and this patch.
- Fix some text_mutex lock miss.
- Use bool instead of int for behavior flags.
- Add additional comment for (un)optimizing path.
Changes in v2:
- Use dynamic allocated buffers and params.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101203095409.2961.82733.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Separate kprobe optimizing code from optimizer, this
will make easy to introducing unoptimizing code in
optimizer.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101203095403.2961.91201.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge disabling kprobe to unregistering kprobe function
and add comments for disabing/unregistring process.
Current unregistering code disables(disarms) kprobes after
checking target kprobe status. This patch changes it to
disabling kprobe first after that it changing the kprobe's
state. This allows to share probe disabling code between
disable_kprobe() and unregister_kprobe().
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101203095356.2961.30152.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename irrelevant uses of "old_p" to more appropriate names.
Originally, "old_p" just meant "the old kprobe on given address"
but current code uses that name as "just another kprobe" or
something like that. This patch renames those pointer names
to more appropriate one for maintainability.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101203095350.2961.48110.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If perf_event_attr.sample_id_all is set it will add the PERF_SAMPLE_ identity
info:
TID, TIME, ID, CPU, STREAM_ID
As a trailer, so that older perf tools can process new files, just ignoring the
extra payload.
With this its possible to do further analysis on problems in the event stream,
like detecting reordering of MMAP and FORK events, etc.
V2: Fixup header size in comm, mmap and task processing, as we have to take into
account different sample_types for each matching event, noticed by Thomas Gleixner.
Thomas also noticed a problem in v2 where if we didn't had space in the buffer we
wouldn't restore the header size.
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Those will be made available in sample like events like MMAP, EXEC, etc in a
followup patch. So precalculate the extra id header space and have a separate
routine to fill them up.
V2: Thomas noticed that the id header needs to be precalculated at
inherit_events too:
LKML-Reference: <alpine.LFD.2.00.1012031245220.2653@localhost6.localdomain6>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <1291318772-30880-2-git-send-email-acme@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The precalculated header size is not updated when an event is inherited. That
results in bogus sample entries for all child events. Bug introduced in c320c7b.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <alpine.LFD.2.00.1012031245220.2653@localhost6.localdomain6>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit(). do_exit may later (via mm_release in
fork.c) do a put_user to a user-controlled address, potentially allowing
a user to leverage an oops into a controlled write into kernel memory.
This is only triggerable in the presence of another bug, but this
potentially turns a lot of DoS bugs into privilege escalations, so it's
worth fixing. I have proof-of-concept code which uses this bug along
with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
I've tested that this is not theoretical.
A more logical place to put this fix might be when we know an oops has
occurred, before we call do_exit(), but that would involve changing
every architecture, in multiple places.
Let's just stick it in do_exit instead.
[akpm@linux-foundation.org: update code comment]
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit a1afb637(switch /proc/irq/*/spurious to seq_file) all
/proc/irq/XX/spurious files show the information of irq 0.
Current irq_spurious_proc_open() passes on NULL as the 3rd argument,
which is used as an IRQ number in irq_spurious_proc_show(), to the
single_open(). Because of this, all the /proc/irq/XX/spurious file
shows IRQ 0 information regardless of the IRQ number.
To fix the problem, irq_spurious_proc_open() must pass on the
appropreate data (IRQ number) to single_open().
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
LKML-Reference: <4CF4B778.90604@jp.fujitsu.com>
Cc: stable@kernel.org [2.6.33+]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
PERF_SAMPLE_{CALLCHAIN,RAW} have variable lenghts per sample, but the others
can be precalculated, reducing a bit the per sample cost.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <new-submission>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The file_ops struct for the "trace" special file defined llseek as seq_lseek().
However, if the file was opened for writing only, seq_open() was not called,
and the seek would dereference a null pointer, file->private_data.
This patch introduces a new wrapper for seq_lseek() which checks if the file
descriptor is opened for reading first. If not, it does nothing.
Cc: <stable@kernel.org>
Signed-off-by: Slava Pestov <slavapestov@google.com>
LKML-Reference: <1290640396-24179-1-git-send-email-slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.
Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.
At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.
Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:
cat /proc/<pid>/autogroup
Displays the task's group and the group's nice level.
echo <nice level> > /proc/<pid>/autogroup
Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.
The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:
echo [01] > /proc/sys/kernel/sched_autogroup_enabled
... which will automatically move tasks to/from the root task group.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the flipping and flopping between calling
unregister_fair_sched_group() on a per-cpu versus per-group basis
we ended up in a bad state.
Remove from the list for the passed cpu as opposed to some
arbitrary index.
( This fixes explosions w/ autogroup as well as a group
creation/destruction stress test. )
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20101130005740.080828123@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The synchronize_srcu_expedited() function is currently quick if there
are no active readers, but will delay a full jiffy if there are any.
If these readers leave their SRCU read-side critical sections quickly,
this is way too long to wait. So this commit first waits ten microseconds,
and only then falls back to jiffy-at-a-time waiting.
Reported-by: Avi Kivity <avi@redhat.com>
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The new (early 2010) implementation of synchronize_sched_expedited() uses
try_stop_cpu() to force a context switch on every CPU. It also permits
concurrent calls to synchronize_sched_expedited() to share a single call
to try_stop_cpu() through use of an atomically incremented
synchronize_sched_expedited_count variable. Unfortunately, this is
subject to failure as follows:
o Task A invokes synchronize_sched_expedited(), try_stop_cpus()
succeeds, but Task A is preempted before getting to the atomic
increment of synchronize_sched_expedited_count.
o Task B also invokes synchronize_sched_expedited(), with exactly
the same outcome as Task A.
o Task C also invokes synchronize_sched_expedited(), again with
exactly the same outcome as Tasks A and B.
o Task D also invokes synchronize_sched_expedited(), but only
gets as far as acquiring the mutex within try_stop_cpus()
before being preempted, interrupted, or otherwise delayed.
o Task E also invokes synchronize_sched_expedited(), but only
gets to the snapshotting of synchronize_sched_expedited_count.
o Tasks A, B, and C all increment synchronize_sched_expedited_count.
o Task E fails to get the mutex, so checks the new value
of synchronize_sched_expedited_count. It finds that the
value has increased, so (wrongly) assumes that its work
has been done, returning despite there having been no
expedited grace period since it began.
The solution is to have the lowest-numbered CPU atomically increment
the synchronize_sched_expedited_count variable within the
synchronize_sched_expedited_cpu_stop() function, which is under
the protection of the mutex acquired by try_stop_cpus(). However, this
also requires that piggybacking tasks wait for three rather than two
instances of try_stop_cpu(), because we cannot control the order in
which the per-CPU callback function occur.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Lai's RCU-callback immediate-adoption patch changes the RCU tracing
output, so update tracing.txt. Also update a few comments to clarify
the synchronization design.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When we handle the CPU_DYING notifier, the whole system is stopped except
for the current CPU. We therefore need no synchronization with the other
CPUs. This allows us to move any orphaned RCU callbacks directly to the
list of any online CPU without needing to run them through the global
orphan lists. These global orphan lists can therefore be dispensed with.
This commit makes thes changes, though currently victimizes CPU 0 @@@.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The first version of synchronize_sched_expedited() used the migration
code in the scheduler, and was therefore implemented in kernel/sched.c.
However, the more recent version of this code no longer uses the
migration code, so this commit moves it to the main RCU source files.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The TREE_RCU tracing had obsolete rcuclassic_trace_init() and
rcuclassic_trace_cleanup() function names. This commit brings them
up to date: rcutree_trace_init() and rcutree_trace_cleanup(),
respectively.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU priority boosting's tracing did not distinguish between ongoing
boosting and completion of boosting. This commit therefore adds this
capability.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add tracing for the tiny RCU implementations, including statistics on
boosting in the case of TINY_PREEMPT_RCU and RCU_BOOST.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add priority boosting, but only for TINY_PREEMPT_RCU. This is enabled
by the default-off RCU_BOOST kernel parameter. The priority to which to
boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel
parameter (defaulting to real-time priority 1) and the time to wait
before boosting the readers blocking a given grace period is controlled
by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds).
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
These warnings are spewed during a build of a 'allnoconfig' kernel
(especially the ones from u64_stats_sync.h show up a lot) when building
with -Wextra (which I often do)..
They are
a) annoying
b) easy to get rid of.
This patch kills them off.
include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Privileged syslog operations currently require CAP_SYS_ADMIN. Split
this off into a new CAP_SYSLOG privilege which we can sanely take away
from a container through the capability bounding set.
With this patch, an lxc container can be prevented from messing with
the host's syslog (i.e. dmesg -c).
Changelog: mar 12 2010: add selinux capability2:cap_syslog perm
Changelog: nov 22 2010:
. port to new kernel
. add a WARN_ONCE if userspace isn't using CAP_SYSLOG
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-By: Kees Cook <kees.cook@canonical.com>
Cc: James Morris <jmorris@namei.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: "Christopher J. PeBenito" <cpebenito@tresys.com>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: James Morris <jmorris@namei.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Remove incorrect open-coded container_of()
perf record: Handle restrictive permissions in /proc/{kallsyms,modules}
x86/kprobes: Prevent kprobes to probe on save_args()
irq_work: Drop cmpxchg() result
perf: Fix owner-list vs exit
x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG
tracing: Fix recursive user stack trace
perf,hw_breakpoint: Initialize hardware api earlier
x86: Ignore trap bits on single step exceptions
tracing: Force arch_local_irq_* notrace for paravirt
tracing: Fix module use of trace_bprintk()
The perf hardware pmu got initialized at various points in the boot,
some before early_initcall() some after (notably arch_initcall).
The problem is that the NMI lockup detector is ran from early_initcall()
and expects the hardware pmu to be present.
Sanitize this by moving all architecture hardware pmu implementations to
initialize at early_initcall() and move the lockup detector to an explicit
initcall right after that.
Cc: paulus <paulus@samba.org>
Cc: davem <davem@davemloft.net>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290707759.2145.119.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
and use it when appropriate.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290525705-6265-1-git-send-email-fbuihuu@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove unused argument, 'dest_cpu' of migrate_task(), and pass runqueue,
as it is always known at the call site.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201011261237.09187.knikanth@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The spinning mutex implementation uses cpu_relax() in busy loops as a
compiler barrier. Depending on the architecture, cpu_relax() may do more
than needed in this specific mutex spin loops. On System z we also give
up the time slice of the virtual cpu in cpu_relax(), which prevents
effective spinning on the mutex.
This patch replaces cpu_relax() in the spinning mutex code with
arch_mutex_cpu_relax(), which can be defined by each architecture that
selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
this patch should not affect other architectures than System z for now.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290437256.7455.4.camel@thinkpad>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued
on offline cpus.
printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets
offlined it schedules the idle process which, before killing its own cpu, will
call tick_nohz_stop_sched_tick(). That function in turn will call
printk_needs_cpu() in order to check if the local tick can be disabled. On
offline cpus this function should naturally return 0 since regardless if the
tick gets disabled or not the cpu will be dead short after. That is besides the
fact that __cpu_disable() should already have made sure that no interrupts on
the offlined cpu will be delivered anyway.
In this case it prevents tick_nohz_stop_sched_tick() to call
select_nohz_load_balancer(). No idea if that really is a problem. However what
made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is
used within __mod_timer() to select a cpu on which a timer gets enqueued. If
printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get
updated when a cpu gets offlined. It may contain the cpu number of an offline
cpu. In turn timers get enqueued on an offline cpu and not very surprisingly
they never expire and cause system hangs.
This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses
get_nohz_timer_target() which doesn't have that problem. However there might be
other problems because of the too early exit tick_nohz_stop_sched_tick() in
case a cpu goes offline.
Easiest way to fix this is just to test if the current cpu is offline and call
printk_tick() directly which clears the condition.
Alternatively I tried a cpu hotplug notifier which would clear the condition,
however between calling the notifier function and printk_needs_cpu() something
could have called printk() again and the problem is back again. This seems to
be the safest fix.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
LKML-Reference: <20101126120235.406766476@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
wake_up_klogd() may get called from preemptible context but uses
__raw_get_cpu_var() to write to a per cpu variable. If it gets preempted
between getting the address and writing to it, the cpu in question could be
offline if the process gets scheduled back and hence writes to the per cpu data
of an offline cpu.
This buggy behaviour was introduced with fa33507a "printk: robustify
printk, fix#2" which was supposed to fix a "using smp_processor_id() in
preemptible" warning.
Let's use this_cpu_write() instead which disables preemption and makes sure
that the outlined scenario cannot happen.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101126124247.GC7023@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stephane noticed that because the perf_sw_event() call is inside the
perf_event_task_sched_out() call it won't get called unless we
have a per-task counter.
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It was found that sometimes children of tasks with inherited events had
one extra event. Eventually it turned out to be due to the list rotation
no being exclusive with the list iteration in the inheritance code.
Cure this by temporarily disabling the rotation while we inherit the events.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I found a trivial bug on initialization of workqueue.
Current init_workqueues doesn't check the result of
allocation of system_unbound_wq, this should be checked
like other queues.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add more clock information to /proc/sched_debug, Thomas wanted to see
the sched_clock_stable state.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Oleg mentioned that there is no actual guarantee the dying cpu's
migration thread is actually finished running when we get there, so
replace the BUG_ON() with a spinloop waiting for it.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
GCC warns us about:
kernel/cpu.c: In function ‘take_cpu_down’:
kernel/cpu.c:200:15: warning: unused variable ‘cpu’
This variable is unused since param->hcpu is directly
used later on in cpu_notify.
Signed-off-by: Dhaval Giani <dhaval_giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290091494.1145.5.camel@gondor.retis>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The recent cgroup-scheduling rework caused a UP build problem.
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 59365d136d.
It turns out that this can break certain existing user land setups.
Quoth Sarah Sharp:
"On Wednesday, I updated my branch to commit 460781b from linus' tree,
and my box would not boot. klogd segfaulted, which stalled the whole
system.
At first I thought it actually hung the box, but it continued booting
after 5 minutes, and I was able to log in. It dropped back to the
text console instead of the graphical bootup display for that period
of time. dmesg surprisingly still works. I've bisected the problem
down to this commit (commit 59365d136d)
The box is running klogd 1.5.5ubuntu3 (from Jaunty). Yes, I know
that's old. I read the bit in the commit about changing the
permissions of kallsyms after boot, but if I can't boot that doesn't
help."
So let's just keep the old default, and encourage distributions to do
the "chmod -r /proc/kallsyms" in their bootup scripts. This is not
worth a kernel option to change default behavior, since it's so easily
done in user space.
Reported-and-bisected-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Cc: Marcus Meissner <meissner@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have in something like the sched_switch event:
field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
When a userspace tool such as perf tries to parse this, the
TASK_COMM_LEN is meaningless. This is done because the TRACE_EVENT() macro
simply uses a #len to show the string of the length. When the length is
an enum, we get a string that means nothing for tools.
By adding a static buffer and a mutex to protect it, we can store the
string into that buffer with snprintf and show the actual number.
Now we get:
field:char prev_comm[16]; offset:12; size:16; signed:1;
Something much more useful.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb,ppc: Fix regression in evr register handling
kgdb,x86: fix regression in detach handling
kdb: fix crash when KDB_BASE_CMD_MAX is exceeded
kdb: fix memory leak in kdb_main.c
This adds a new trace event internal flag that allows them to be
used in perf by non privileged users in case of task bound tracing.
This is desired for syscalls tracepoint because they don't leak
global system informations, like some other tracepoints.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jason Baron <jbaron@redhat.com>
This patch is a logical extension of the protection provided by
CONFIG_DEBUG_RODATA to LKMs. The protection is provided by
splitting module_core and module_init into three logical parts
each and setting appropriate page access permissions for each
individual section:
1. Code: RO+X
2. RO data: RO+NX
3. RW data: RW+NX
In order to achieve proper protection, layout_sections() have
been modified to align each of the three parts mentioned above
onto page boundary. Next, the corresponding page access
permissions are set right before successful exit from
load_module(). Further, free_module() and sys_init_module have
been modified to set module_core and module_init as RW+NX right
before calling module_free().
By default, the original section layout and access flags are
preserved. When compiled with CONFIG_DEBUG_SET_MODULE_RONX=y,
the patch will page-align each group of sections to ensure that
each page contains only one type of content and will enforce
RO/NX for each group of pages.
-v1: Initial proof-of-concept patch.
-v2: The patch have been re-written to reduce the number of #ifdefs
and to make it architecture-agnostic. Code formatting has also
been corrected.
-v3: Opportunistic RO/NX protection is now unconditional. Section
page-alignment is enabled when CONFIG_DEBUG_RODATA=y.
-v4: Removed most macros and improved coding style.
-v5: Changed page-alignment and RO/NX section size calculation
-v6: Fixed comments. Restricted RO/NX enforcement to x86 only
-v7: Introduced CONFIG_DEBUG_SET_MODULE_RONX, added
calls to set_all_modules_text_rw() and set_all_modules_text_ro()
in ftrace
-v8: updated for compatibility with linux 2.6.33-rc5
-v9: coding style fixes
-v10: more coding style fixes
-v11: minor adjustments for -tip
-v12: minor adjustments for v2.6.35-rc2-tip
-v13: minor adjustments for v2.6.37-rc1-tip
Signed-off-by: Siarhei Liakh <sliakh.lkml@gmail.com>
Signed-off-by: Xuxian Jiang <jiang@cs.ncsu.edu>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dave Jones <davej@redhat.com>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4CE2F914.9070106@free.fr>
[ minor cleanliness edits, -v14: build failure fix ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Formerly sched_group_set_shares would force a rebalance by overflowing domain
share sums. Now that per-cpu averages are maintained we can set the true value
by issuing an update_cfs_shares() following a tg->shares update.
Also initialize tg se->load to 0 for consistency since we'll now set correct
weights on enqueue.
Signed-off-by: Paul Turner <pjt@google.com?>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.465521344@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Refactor the global load updates from update_shares_cpu() so that
update_cfs_load() can update global load when it is more than ~10%
out of sync.
The new global_load parameter allows us to force an update, regardless of
the error factor so that we can synchronize w/ update_shares().
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.377473595@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the system is busy, dilation of rq->next_balance makes lb->update_shares()
insufficiently frequent for threads which don't sleep (no dequeue/enqueue
updates). Adjust for this by making demand based updates based on the
accumulation of execution time sufficient to wrap our averaging window.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.291159744@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since shares updates are no longer expensive and effectively local, update them
at idle_balance(). This allows us to more quickly redistribute shares to
another cpu when our load becomes idle.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.204191702@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a new sysctl for the shares window and disambiguate it from
sched_time_avg.
A 10ms window appears to be a good compromise between accuracy and performance.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.112173964@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Avoid duplicate shares update calls by ensuring children always appear before
parents in rq->leaf_cfs_rq_list.
This allows us to do a single in-order traversal for update_shares().
Since we always enqueue in bottom-up order this reduces to 2 cases:
1) Our parent is already in the list, e.g.
root
\
b
/\
c d* (root->b->c already enqueued)
Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b.
2) Our parent does not appear in the list (or we have no parent)
In this case we enqueue to the tail of the list, if our parent is subsequently enqueued
(bottom-up) it will appear to our right by the same rule.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.022488865@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
the put path since nr_running accounting occurs at deactivation.
It's also not safe to make the removal decision based on load_avg as this fails
with both high periods and low shares. Resolve this by clipping history after
4 periods without activity.
Note: the above will always occur from update_shares() since in the
last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load
is called.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.933428187@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As part of enqueue_entity both a new entity weight and its contribution to the
queuing cfs_rq / rq are updated. Since update_cfs_shares will only update the
queueing weights when the entity is on_rq (which in this case it is not yet),
there's a dependency loop here:
update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight
account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*]
Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as
if we had accounted the enqueue already.
This was also resulting in rq->load corruption previously.
[*]: this dependency also exists when using the group cfs_rq w/
update_cfs_shares as the weight of the enqueued entity changes
without the load being updated.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.844900206@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make tg_shares_up() use the active cgroup list, this means we cannot
do a strict bottom-up walk of the hierarchy, but assuming its a very
wide tree with a small number of active groups it should be a win.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.754159484@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.
This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.666535048@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
By tracking a per-cpu load-avg for each cfs_rq and folding it into a
global task_group load on each tick we can rework tg_shares_up to be
strictly per-cpu.
This should improve cpu-cgroup performance for smp systems
significantly.
[ Paul: changed to use queueing cfs_rq + bug fixes ]
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.580480400@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While discussing the need for sched_idle_next(), Oleg remarked that
since try_to_wake_up() ensures sleeping tasks will end up running on a
sane cpu, we can do away with migrate_live_tasks().
If we then extend the existing hack of migrating current from
CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
need for the sched_idle_next() abomination disappears as well, since
idle will be the only possible thread left after the migration thread
stops.
This greatly simplifies the hot-unplug task migration path, as can be
seen from the resulting code reduction (and about half the new lines
are comments).
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1289851597.2109.547.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The compiler warned us about:
kernel/irq_work.c: In function 'irq_work_run':
kernel/irq_work.c:148: warning: value computed is not used
Dropping the cmpxchg() result is indeed weird, but correct -
so annotate away the warning.
Signed-off-by: Sergio Aguirre <saaguirre@ti.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1289930567-17828-1-git-send-email-saaguirre@ti.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Oleg noticed that a perf-fd keeping a reference on the creating task
leads to a few funny side effects.
There's two different aspects to this:
- kernel based perf-events, these should not take out
a reference on the creating task and appear on the task's
event list since they're not bound to fds nor visible
to userspace.
- fork() and pthread_create(), these can lead to the creating
task dying (and thus the task's event-list becomming useless)
but keeping the list and ref alive until the event is closed.
Combined they lead to malfunction of the ptrace hw_tracepoints.
Cure this by not considering kernel based perf_events for the
owner-list and destroying the owner-list when the owner dies.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <1289576883.2084.286.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
An earlier commit reverts idle balancing throttling reset to fix a 30%
regression in volanomark throughput. We still need to reset idle_stamp
when we pull a task in newidle balance.
Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290022924-3548-1-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that we have a new nmi_watchdog that is more generic and
sits on top of the perf subsystem, we really do not need the old
nmi_watchdog any more.
In addition, the old nmi_watchdog doesn't really work if you are
using the default clocksource, hpet. The old nmi_watchdog code
relied on local apic interrupts to determine if the cpu is still
alive. With hpet as the clocksource, these interrupts don't
increment any more and the old nmi_watchdog triggers false
postives.
This piece removes the old nmi_watchdog code and stubs out any
variables and functions calls. The stubs are the same ones used
by the new nmi_watchdog code, so it should be well tested.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: fweisbec@gmail.com
Cc: gorcunov@openvz.org
LKML-Reference: <1289578944-28564-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers. Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked. If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.
But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When the number of dyanmic kdb commands exceeds KDB_BASE_CMD_MAX, the
kernel will fault.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Call kfree in the error path as well as the success path in kdb_ll().
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Making /proc/kallsyms readable only for root by default makes it
slightly harder for attackers to write generic kernel exploits by
removing one source of knowledge where things are in the kernel.
This is the second submit, discussion happened on this on first submit
and mostly concerned that this is just one hole of the sieve ... but
one of the bigger ones.
Changing the permissions of at least System.map and vmlinux is also
required to fix the same set, but a packaging issue.
Target of this starter patch and follow ups is removing any kind of
kernel space address information leak from the kernel.
[ Side note: the default of root-only reading is the "safe" value, and
it's easy enough to then override at any time after boot. The /proc
filesystem allows root to change the permissions with a regular
chmod, so you can "revert" this at run-time by simply doing
chmod og+r /proc/kallsyms
as root if you really want regular users to see the kernel symbols.
It does help some tools like "perf" figure them out without any
setup, so it may well make sense in some situations. - Linus ]
Signed-off-by: Marcus Meissner <meissner@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Eugene Teo <eugeneteo@kernel.org>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix cross-sched-class wakeup preemption
sched: Fix runnable condition for stoptask
sched: Use group weight, idle cpu metrics to fix imbalances during idle
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM / PM QoS: Fix reversed min and max
PM / OPP: Hide OPP configuration when SoCs do not provide an implementation
PM: Allow devices to be removed during late suspend and early resume
Move it out of printk.c so that we can use it all over the code. There
are some potential users which will be converted to that macro in next
patches.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] kprobes: Fix the return address of multiple kretprobes
[S390] kprobes: disable interrupts throughout
[S390] ftrace: build without frame pointers on s390
[S390] mm: add devmem_is_allowed() for STRICT_DEVMEM checking
[S390] vmlogrdr: purge after recording is switched off
[S390] cio: fix incorrect ccw_device_init_count
[S390] tape: add medium state notifications
[S390] fix get_user_pages_fast
Sigh...
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While at it, fix two checkpatch errors.
Several non-const struct instances constified by this patch were added after
the introduction of platform_suspend_ops in checkpatch.pl's list of "should
be const" structs (79404849e9).
Patch against mainline.
Inspired by hunks of the grsecurity patch, updated for newer kernels.
Signed-off-by: Lionel Debroux <lionel_debroux@yahoo.fr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Patch against mainline.
Changes since v1: added one hunk; no longer adding "const" qualifier to
pointers in platform_hibernation_ops after seeing
b4144e4f6e.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The addition of CONFIG_SECURITY_DMESG_RESTRICT resulted in a build
failure when CONFIG_PRINTK=n. This is because the capabilities code
which used the new option was built even though the variable in question
didn't exist.
The patch here fixes this by moving the capabilities checks out of the
LSM and into the caller. All (known) LSMs should have been calling the
capabilities hook already so it actually makes the code organization
better to eliminate the hook altogether.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pm_qos_get_value had min and max reversed, causing all pm_qos
requests to have no effect.
Signed-off-by: Colin Cross <ccross@android.com>
Acked-by: mark <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
The user stack trace can fault when examining the trace. Which
would call the do_page_fault handler, which would trace again,
which would do the user stack trace, which would fault and call
do_page_fault again ...
Thus this is causing a recursive bug. We need to have a recursion
detector here.
[ Resubmitted by Jiri Olsa ]
[ Eric Dumazet recommended using __this_cpu_* instead of __get_cpu_* ]
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1289390172-9730-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: remove unused copy_io_context()
Documentation: remove anticipatory scheduler info
block: remove REQ_HARDBARRIER
ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
ioprio: fix RCU locking around task dereference
block: ioctl: fix information leak to userland
block: read i_size with i_size_read()
cciss: fix proc warning on attempt to remove non-existant directory
bio: take care not overflow page count when mapping/copying user data
block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
block: take care not to overflow when calculating total iov length
block: check for proper length of iov entries in blk_rq_map_user_iov()
cciss: remove controllers supported by hpsa
cciss: use usleep_range not msleep for small sleeps
cciss: limit commands allocated on reset_devices
cciss: Use kernel provided PCI state save and restore functions
cciss: fix board status waiting code
drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
...
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, amd: Use kmalloc_node(,__GFP_ZERO) for northbridge structure allocation
perf_events: Fix time tracking in samples
perf trace: update usage
perf trace: update Documentation with new perf trace variants
perf trace: live-mode command-line cleanup
perf trace record: handle commands correctly
perf record: make the record options available outside perf record
perf trace scripting: remove system-wide param from shell scripts
perf trace scripting: fix some small memory leaks and missing error checks
perf: Fix usages of profile_cpu in builtin-top.c to use cpu_list
perf, ui: Eliminate stack-smashing protection compiler complaint
The kernel syslog contains debugging information that is often useful
during exploitation of other vulnerabilities, such as kernel heap
addresses. Rather than futilely attempt to sanitize hundreds (or
thousands) of printk statements and simultaneously cripple useful
debugging functionality, it is far simpler to create an option that
prevents unprivileged users from reading the syslog.
This patch, loosely based on grsecurity's GRKERNSEC_DMESG, creates the
dmesg_restrict sysctl. When set to "0", the default, no restrictions are
enforced. When set to "1", only users with CAP_SYS_ADMIN can read the
kernel syslog via dmesg(8) or other mechanisms.
[akpm@linux-foundation.org: explain the config option in kernel.txt]
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Eugene Teo <eugeneteo@kernel.org>
Acked-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per task latencytop accumulator prematurely terminates due to erroneous
placement of latency_record_count. It should be incremented whenever a
new record is allocated instead of increment on every latencytop event.
Also fix search iterator to only search known record events instead of
blindly searching all pre-allocated space.
Signed-off-by: Ken Chen <kenchen@google.com>
Reviewed-by: Arjan van de Ven <arjan@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clean_sort_range() should return a number of nonempty elements of range
array, but if the array is full clean_sort_range() returns 0.
The problem is that the number of nonempty elements is evaluated by
finding the first empty element of the array. If there is no such element
it returns an initial value of local variable nr_range that is zero.
The fix is trivial: it changes initial value of nr_range to size of the
array.
The bug can lead to loss of information regarding all ranges, since
typically returned value of clean_sort_range() is considered as an actual
number of ranges in the array after a series of add/subtract operations.
Found by Analytical Verification project of Linux Verification Center
(linuxtesting.org), thanks to Alexander Kolosov.
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using early debugging, the kernel does not initialize the
hw_breakpoint API early enough and causes the late initialization of
the kernel debugger to fail. The boot arguments are:
earlyprintk=vga ekgdboc=kbd kgdbwait
Then simply type "go" at the kdb prompt and boot. The kernel will
later emit the message:
kgdb: Could not allocate hwbreakpoints
And at that point the kernel debugger will cease to work correctly.
The solution is to initialize the hw_breakpoint at the same time that
all the other perf call backs are initialized instead of using a
core_initcall() initialization which happens well after the kernel
debugger can make use of hardware breakpoints.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4CD3396D.1090308@windriver.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Instead of dealing with sched classes inside each check_preempt_curr()
implementation, pull out this logic into the generic wakeup preemption
path.
This fixes a hang in KVM (and others) where we are waiting for the
stop machine thread to run ...
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1288891946.2039.31.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On use of trace_printk() there's a macro that determines if the format
is static or a variable. If it is static, it defaults to __trace_bprintk()
otherwise it uses __trace_printk().
A while ago, Lai Jiangshan added __trace_bprintk(). In that patch, we
discussed a way to allow modules to use it. The difference between
__trace_bprintk() and __trace_printk() is that for faster processing,
just the format and args are stored in the trace instead of running
it through a sprintf function. In order to do this, the format used
by the __trace_bprintk() had to be persistent.
See commit 1ba28e02a1
The problem comes with trace_bprintk() where the module is unloaded.
The pointer left in the buffer is still pointing to the format.
To solve this issue, the formats in the module were copied into kernel
core. If the same format was used, they would use the same copy (to prevent
memory leak). This all worked well until we tried to merge everything.
At the time this was written, Lai Jiangshan, Frederic Weisbecker,
Ingo Molnar and myself were all touching the same code. When this was
merged, we lost the part of it that was in module.c. This kept out the
copying of the formats and unloading the module could cause bad pointers
left in the ring buffer.
This patch adds back (with updates required for current kernel) the
module code that sets up the necessary pointers.
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the OPP API is only useful with an appropraite SoC-specific
implementation there is no point in offering the ability to enable
the API on general systems. Provide an ARCH_HAS OPP Kconfig symbol
which masks out the option unless selected by an implementation.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Acked-by: Nishanth Menon <nm@ti.com>
Acked-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Heiko reported that the TASK_RUNNING check is not sufficient for
CONFIG_PREEMPT=y since we can get preempted with !TASK_RUNNING.
He suggested adding a ->se.on_rq test to the existing TASK_RUNNING
one, however TASK_RUNNING will always have ->se.on_rq, so we might as
well reduce that to a single test.
[ stop tasks should never get preempted, but its good to handle
this case correctly should this ever happen ]
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently we consider a sched domain to be well balanced when the imbalance
is less than the domain's imablance_pct. As the number of cores and threads
are increasing, current values of imbalance_pct (for example 25% for a
NUMA domain) are not enough to detect imbalances like:
a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
socket. Leading to an idle HT cpu.
b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
socket and 7 on another socket. Leaving one core in a socket idle
whereas in another socket we have a core having both its HT siblings busy.
While this issue can be fixed by decreasing the domain's imbalance_pct
(by making it a function of number of logical cpus in the domain), it
can potentially cause more task migrations across sched groups in an
overloaded case.
Fix this by using imbalance_pct only during newly_idle and busy
load balancing. And during idle load balancing, check if there
is an imbalance in number of idle cpu's across the busiest and this
sched_group or if the busiest group has more tasks than its weight that
the idle cpu in this_group can pull.
Reported-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch corrects time tracking in samples. Without this patch
both time_enabled and time_running are bogus when user asks for
PERF_SAMPLE_READ.
One uses PERF_SAMPLE_READ to sample the values of other counters
in each sample. Because of multiplexing, it is necessary to know
both time_enabled, time_running to be able to scale counts correctly.
In this second version of the patch, we maintain a shadow
copy of ctx->time which allows us to compute ctx->time without
calling update_context_time() from NMI context. We avoid the
issue that update_context_time() must always be called with
ctx->lock held.
We do not keep shadow copies of the other event timings
because if the lead event is overflowing then it is active
and thus it's been scheduled in via event_sched_in() in
which case neither tstamp_stopped, tstamp_running can be modified.
This timing logic only applies to samples when PERF_SAMPLE_READ
is used.
Note that this patch does not address timing issues related
to sampling inheritance between tasks. This will be addressed
in a future patch.
With this patch, the libpfm4 example task_smpl now reports
correct counts (shown on 2.4GHz Core 2):
$ task_smpl -p 2400000000 -e unhalted_core_cycles:u,instructions_retired:u,baclears noploop 5
noploop for 5 seconds
IIP:0x000000004006d6 PID:5596 TID:5596 TIME:466,210,211,430 STREAM_ID:33 PERIOD:2,400,000,000 ENA=1,010,157,814 RUN=1,010,157,814 NR=3
2,400,000,254 unhalted_core_cycles:u (33)
2,399,273,744 instructions_retired:u (34)
53,340 baclears (35)
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4cc6e14b.1e07e30a.256e.5190@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The futex_q struct has grown considerably over the last couple years. I
believe it now merits a static initializer to avoid uninitialized data
errors (having spent more time than I care to admit debugging an uninitialized
q.bitset in an experimental new op code).
With the key initializer built in, several of the FUTEX_KEY_INIT calls can
be removed.
V2: use a static variable instead of an init macro.
use a C99 initializer and don't rely on variable ordering in the struct.
V3: make futex_q_init const
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1289252428-18383-1-git-send-email-dvhart@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In the early days we passed the mmap sem around. That became the
"int fshared" with the fast gup improvements. Then we added
"int clockrt" in places. This patch unifies these options as "flags".
[ tglx: Split out the stale fshared cleanup ]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1289250609-16304-1-git-send-email-dvhart@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The fast GUP changes stopped using the fshared flag in
put_futex_keys(), but we kept the interface the same.
Cleanup all stale users.
This patch is split out from Darren Harts combo patch which also
combines various flags. This way the changes are clearly separated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhart@linux.intel.com>
LKML-Reference: <1289250609-16304-1-git-send-email-dvhart@linux.intel.com>
REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
at this point is:
- various checks inside the block layer.
- sanity checks in bio based drivers.
- now unused bio_empty_barrier helper.
- Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
but Xen really needs to sort out it's barrier situaton.
- setting of ordered tags in uas - dead code copied from old scsi
drivers.
- scsi different retry for barriers - it's dead and should have been
removed when flushes were converted to FS requests.
- blktrace handling of barriers - removed. Someone who knows blktrace
better should add support for REQ_FLUSH and REQ_FUA, though.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Since commit 1dcc41bb (futex: Change 3rd arg of fetch_robust_entry()
to unsigned int*) some gcc versions decided to emit the following
warning:
kernel/futex.c: In function ‘exit_robust_list’:
kernel/futex.c:2492: warning: ‘next_pi’ may be used uninitialized in this function
The commit did not introduce the warning as gcc should have warned
before that commit as well. It's just gcc being silly.
The code path really can't result in next_pi being unitialized (or
should not), but let's keep the build clean. Annotate next_pi as an
uninitialized_var.
[ tglx: Addressed the same issue in futex_compat.c and massaged the
changelog ]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1288897200-13008-1-git-send-email-dvhart@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 4221a9918e "Add RCU check for
find_task_by_vpid()" introduced rcu_lockdep_assert to find_task_by_pid_ns.
Add rcu_read_lock/rcu_read_unlock to call find_task_by_vpid.
Tetsuo Handa wrote:
| Quoting from one of posts in that thead
| http://kerneltrap.org/mailarchive/linux-kernel/2010/2/8/4536388
|
|| Usually tasklist gives enough protection, but if copy_process() fails
|| it calls free_pid() lockless and does call_rcu(delayed_put_pid().
|| This means, without rcu lock find_pid_ns() can't scan the hash table
|| safely.
Thomas Gleixner wrote:
| We can remove the tasklist_lock while at it. rcu_read_lock is enough.
Patch also replaces thread_group_leader with has_group_leader_pid
in accordance to comment by Oleg Nesterov:
| ... thread_group_leader() check is not relaible without
| tasklist. If we race with de_thread() find_task_by_vpid() can find
| the new leader before it updates its ->group_leader.
|
| perhaps it makes sense to change posix_cpu_timer_create() to use
| has_group_leader_pid() instead, just to make this code not look racy
| and avoid adding new problems.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <20101103165256.GD30053@swordfish.minsk.epam.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
s390 doesn't need FRAME_POINTERS in order to have a working function tracer.
We don't need frame pointers in order to get strack traces since we always
have valid backchains by using the -mkernel-backchain gcc option.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Commit d9ca07a05c ("watchdog: Avoid kernel crash when disabling
watchdog") introduces a section mismatch.
Now that we reference no_watchdog from non-__init code it can no longer
be __initdata.
Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
posix-cpu-timers.c correctly assumes that the dying process does
posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
timers from signal->cpu_timers list.
But, it also assumes that timer->it.cpu.task is always the group
leader, and thus the dead ->task means the dead thread group.
This is obviously not true after de_thread() changes the leader.
After that almost every posix_cpu_timer_ method has problems.
It is not simple to fix this bug correctly. First of all, I think
that timer->it.cpu should use struct pid instead of task_struct.
Also, the locking should be reworked completely. In particular,
tasklist_lock should not be used at all. This all needs a lot of
nontrivial and hard-to-test changes.
Change __exit_signal() to do posix_cpu_timers_exit_group() when
the old leader dies during exec. This is not the fix, just the
temporary hack to hide the problem for 2.6.37 and stable. IOW,
this is obviously wrong but this is what we currently have anyway:
cpu timers do not work after mt exec.
In theory this change adds another race. The exiting leader can
detach the timers which were attached to the new leader. However,
the window between de_thread() and release_task() is small, we
can pretend that sys_timer_create() was called before de_thread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can optimize kernel/relay.c::relay_alloc_page_array() slightly by
using vzalloc. The patch makes these changes:
- use vzalloc instead of vmalloc+memset.
- remove redundant local variable 'array'.
- declare local 'pa_size' as const.
Cuts down nicely on both source and object-code size.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes a typo in the error message raised by audit when auditd has died.
Signed-off-by: Ross Kirk <ross.kirk@nexor.com>
--
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix up irq_node() for irq_data changes.
genirq: Add single IRQ reservation helper
genirq: Warn if enable_irq is called before irq is set up
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
semaphore: Remove mutex emulation
staging: Final semaphore cleanup
jbd2: Convert jbd2_slab_create_sem to mutex
hpfs: Convert sbi->hpfs_creation_de to mutex
Fix up trivial change/delete conflicts with deleted 'dream' drivers
(drivers/staging/dream/camera/{mt9d112.c,mt9p012_fox.c,mt9t013.c,s5k3e2fx.c})
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
jump label: Add work around to i386 gcc asm goto bug
x86, ftrace: Use safe noops, drop trap test
jump_label: Fix unaligned traps on sparc.
jump label: Make arch_jump_label_text_poke_early() optional
jump label: Fix error with preempt disable holding mutex
oprofile: Remove deprecated use of flush_scheduled_work()
oprofile: Fix the hang while taking the cpu offline
jump label: Fix deadlock b/w jump_label_mutex vs. text_mutex
jump label: Fix module __init section race
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Check irq_remapped instead of remapping_enabled in destroy_irq()
Normal syscall audit doesn't catch 5th argument of syscall. It also
doesn't catch the contents of userland structures pointed to be
syscall argument, so for both old and new mmap(2) ABI it doesn't
record the descriptor we are mapping. For old one it also misses
flags.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Protect the task lookups in audit_receive_msg() with rcu_read_lock()
instead of tasklist_lock and use lock/unlock_sighand to protect
against the exit race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
audit_receive_msg() sends uninitialized data for AUDIT_TTY_GET when
the task was not found.
Send reply only when task was found.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
While auditing all tasklist_lock read_lock sites I stumbled over the
following call chain:
audit_prepare_user_tty()
read_lock(&tasklist_lock);
tty_audit_push_task();
mutex_lock(&buf->mutex);
--> buf->mutex is locked with preemption disabled.
Solve this by acquiring a reference to the task struct under
rcu_read_lock and call tty_audit_push_task outside of the preempt
disabled region.
Move all code which needs to be protected by sighand lock into
tty_audit_push_task() and use lock/unlock_sighand as we do not hold
tasklist_lock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I was doing some namespace checks and found some simple stuff in
audit that could be cleaned up. Make some functions static, and
put const on make_reply payload arg.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add support for matching by security label (e.g. SELinux context) of
the sender of an user-space audit record.
The audit filter code already allows user space to configure such
filters, but they were ignored during evaluation. This patch implements
evaluation of these filters.
For example, after application of this patch, PAM authentication logs
caused by cron can be disabled using
auditctl -a user,never -F subj_type=crond_t
Signed-off-by: Miloslav Trmac <mitr@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The kgdb_disable_hw_debug() was an architecture specific function for
disabling all hardware breakpoints on a per cpu basis when entering
the debug core.
This patch will remove the weak function kdbg_disable_hw_debug() and
change it into a call back which lives with the rest of hw breakpoint
call backs in struct kgdb_arch.
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The kdb_current legally be equal to NULL in the early boot of the x86
arch. The problem pcan be observed by booting with the kernel arguments:
earlyprintk=vga ekgdboc=kbd kgdbwait
The kdb shell will oops on entry and recursively fault because it
cannot get past the final stage of shell initialization.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Rusty pointed out that the per_cpu command uses up lots of space on
the stack and the cpu supress mask is probably not needed.
This patch removes the need for the supress mask as well as fixing up
the following problems with the kdb per_cpu command:
* The per_cpu command should allow an address as an argument
* When you have more data than can be displayed on one screen allow
the user to break out of the print loop.
Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Some archs do not need to do anything special for jump labels on
startup (like MIPS). This patch adds a weak function stub for
arch_jump_label_text_poke_early();
Cc: Jason Baron <jbaron@redhat.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Daney <ddaney@caviumnetworks.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <1286218615-24011-2-git-send-email-ddaney@caviumnetworks.com>
LKML-Reference: <20101015201037.703989993@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Kprobes and jump label were having a race between mutexes that
was fixed by reordering the jump label. But this reordering
moved the jump label mutex into a preempt disable location.
This patch does a little fiddling to move the grabbing of
the jump label mutex from inside the preempt disable section
and still keep the order correct between the mutex and the
kprobes lock.
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
x86: allocate space within a region top-down
x86: update iomem_resource end based on CPU physical address capabilities
x86/PCI: allocate space from the end of a region, not the beginning
PCI: allocate bus resources from the top down
resources: support allocating space within a region from the top down
resources: handle overflow when aligning start of available area
resources: ensure callback doesn't allocate outside available space
resources: factor out resource_clip() to simplify find_resource()
resources: add a default alignf to simplify find_resource()
x86/PCI: MMCONFIG: fix region end calculation
PCI: Add support for polling PME state on suspended legacy PCI devices
PCI: Export some PCI PM functionality
PCI: fix message typo
PCI: log vendor/device ID always
PCI: update Intel chipset names and defines
PCI: use new ccflags variable in Makefile
PCI: add PCI_MSIX_TABLE/PBA defines
PCI: add PCI vendor id for STmicroelectronics
x86/PCI: irq and pci_ids patch for Intel Patsburg DeviceIDs
PCI: OLPC: Only enable PCI configuration type override on XO-1
...
register_kprobe() downs the 'text_mutex' and then calls
jump_label_text_reserved(), which downs the 'jump_label_mutex'.
However, the jump label code takes those mutexes in the reverse
order.
Fix by requiring the caller of jump_label_text_reserved() to do
the jump label locking via the newly added: jump_label_lock(),
jump_label_unlock(). Currently, kprobes is the only user
of jump_label_text_reserved().
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <759032c48d5e30c27f0bba003d09bffa8e9f28bb.1285965957.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jump label uses is_module_text_address() to ensure that the module
__init sections are valid before updating them. However, between the
check for a valid module __init section and the subsequent jump
label update, the module's __init section could be freed out from under
us.
We fix this potential race by adding a notifier callback to the
MODULE_STATE_LIVE state. This notifier is called *after* the __init
section has been run but before it is going to be freed. In the
callback, the jump label code zeros the key value for any __init jump
code within the module, and we add a check for a non-zero key value when
we update jump labels. In this way we require no additional data
structures.
Thanks to Mathieu Desnoyers for pointing out this race condition.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <c6f037b7598777668025ceedd9294212fd95fa34.1285965957.git.jbaron@redhat.com>
[ Renamed remove_module_init() to remove_jump_label_module_init()
as suggested by Masami Hiramatsu. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: (44 commits)
MN10300: Save frame pointer in thread_info struct rather than global var
MN10300: Change "Matsushita" to "Panasonic".
MN10300: Create a defconfig for the ASB2364 board
MN10300: Update the ASB2303 defconfig
MN10300: ASB2364: Add support for SMSC911X and SMC911X
MN10300: ASB2364: Handle the IRQ multiplexer in the FPGA
MN10300: Generic time support
MN10300: Specify an ELF HWCAP flag for MN10300 Atomic Operations Unit support
MN10300: Map userspace atomic op regs as a vmalloc page
MN10300: And Panasonic AM34 subarch and implement SMP
MN10300: Delete idle_timestamp from irq_cpustat_t
MN10300: Make various interrupt priority settings configurable
MN10300: Optimise do_csum()
MN10300: Implement atomic ops using atomic ops unit
MN10300: Make the FPU operate in non-lazy mode under SMP
MN10300: SMP TLB flushing
MN10300: Use the [ID]PTEL2 registers rather than [ID]PTEL for TLB control
MN10300: Make the use of PIDR to mark TLB entries controllable
MN10300: Rename __flush_tlb*() to local_flush_tlb*()
MN10300: AM34 erratum requires MMUCTR read and write on exception entry
...
* 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
NULL-terminate all pci_device_id tables
(trivial) Fix compiler warning in kernel/modules.c
If the same resource is inserted to the resource tree (maybe not on
purpose), a dead loop will be created. In this situation, The kernel does
not report any warning or error :(
The command below will show a endless print.
#cat /proc/iomem
[akpm@linux-foundation.org: add WARN_ON()]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The taskstats interface uses microsecond granularity for the user and
system time values. The conversion from cputime to the taskstats values
uses the cputime_to_msecs primitive which effectively limits the
granularity to milliseconds. Add the cputime_to_usecs primitive for
architectures that have better, more precise CPU time values. Remove
cputime_to_msecs primitive because there are no more users left.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Luck Tony <tony.luck@intel.com>
Cc: Shailabh Nagar <nagar1234@in.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Separate the finding of a task_struct by pid or tgid from filling the
taskstats data. This makes the code more readable.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move each taskstats command into a single function. This makes the code
more readable and makes it easier to add new commands.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prepare_reply() sets up an skb for the response. The payload contains:
+--------------------------------+
| genlmsghdr - 4 bytes |
+--------------------------------+
| NLA header - 4 bytes | /* Aggregate header */
+-+------------------------------+
| | NLA header - 4 bytes | /* PID header */
| +------------------------------+
| | pid/tgid - 4 bytes |
| +------------------------------+
| | NLA header - 4 bytes | /* stats header */
| + -----------------------------+ <- oops. aligned on 4 byte boundary
| | struct taskstats - 328 bytes |
+-+------------------------------+
The start of the taskstats struct must be 8 byte aligned on IA64 (and
other systems with 8 byte alignment rules for 64-bit types) or runtime
alignment warnings will be issued.
This patch pads the pid/tgid field out to sizeof(long), which forces the
alignment of taskstats. The getdelays userspace code is ok with this
since it assumes 32-bit pid/tgid and then honors that header's length
field.
An array is used to avoid exposing kernel memory contents to userspace in
the response.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In /proc/stat, the number of per-IRQ event is shown by making a sum each
irq's events on all cpus. But we can make use of kstat_irqs().
kstat_irqs() do the same calculation, If !CONFIG_GENERIC_HARDIRQ,
it's not a big cost. (Both of the number of cpus and irqs are small.)
If a system is very big and CONFIG_GENERIC_HARDIRQ, it does
for_each_irq()
for_each_cpu()
- look up a radix tree
- read desc->irq_stat[cpu]
This seems not efficient. This patch adds kstat_irqs() for
CONFIG_GENRIC_HARDIRQ and change the calculation as
for_each_irq()
look up radix tree
for_each_cpu()
- read desc->irq_stat[cpu]
This reduces cost.
A test on (4096cpusp, 256 nodes, 4592 irqs) host (by Jack Steiner)
%time cat /proc/stat > /dev/null
Before Patch: 2.459 sec
After Patch : .561 sec
[akpm@linux-foundation.org: unexport kstat_irqs, coding-style tweaks]
[akpm@linux-foundation.org: fix unused variable 'per_irq_sum']
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Jack Steiner <steiner@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() releases and regrabs tasklist_lock but was missing
proper annotations. Add it. This remove following sparse warning:
warning: context imbalance in 'find_new_reaper' - unexpected unlock
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
execve() has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lock_task_sighand() grabs sighand->siglock in case of returning non-NULL
but unlock_task_sighand() releases it unconditionally. This leads sparse
to complain about the lock context imbalance. Rename and wrap
lock_task_sighand() using __cond_lock() macro to make sparse happy.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use new 'datavp' and 'datalp' variables to remove unnecesary castings.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since userspace API of ptrace syscall defines @addr and @data as void
pointers, it would be more appropriate to define them as unsigned long in
kernel. Therefore related functions are changed also.
'unsigned long' is typically used in other places in kernel as an opaque
data type and that using this helps cleaning up a lot of warnings from
sparse.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ns_cgroup will be removed very soon. Let's warn, for this version,
ns_cgroup is deprecated.
Make ns_cgroup and clone_children exclusive. If the clone_children is set
and the ns_cgroup is mounted, let's fail with EINVAL when the ns_cgroup
subsys is created (a printk will help the user to understand why the
creation fails).
Update the feature remove schedule file with the deprecated ns_cgroup.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function "strcpy" is used without check for maximum allowed source string
length and could cause destination string overflow. Check for string
length is added before using "strcpy". Function now is return error if
source string length is more than a maximum.
akpm: presently considered NotABug, but add the check for general
future-safeness and robustness.
Signed-off-by: Evgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current behavior:
=================
(1) When we mount a cgroup, we can specify the 'all' option which
means to enable all the cgroup subsystems. This is the default option
when no option is specified.
(2) If we want to mount a cgroup with a subset of the supported cgroup
subsystems, we have to specify a subsystems name list for the mount
option.
(3) If we specify another option like 'noprefix' or 'release_agent',
the actual code wants the 'all' or a subsystem name option specified
also. Not critical but a bit not friendly as we should assume (1) in
this case.
(4) Logically, the 'all' option is mutually exclusive with a subsystem
name, but this is not detected.
In other words:
succeed : mount -t cgroup -o all,freezer cgroup /cgroup
=> is it 'all' or 'freezer' ?
fails : mount -t cgroup -o noprefix cgroup /cgroup
=> succeed if we do '-o noprefix,all'
The following patches consolidate a bit the mount options check.
New behavior:
=============
(1) untouched
(2) untouched
(3) the 'all' option will be by default when specifying other than
a subsystem name option
(4) raises an error
In other words:
fails : mount -t cgroup -o all,freezer cgroup /cgroup
succeed : mount -t cgroup -o noprefix cgroup /cgroup
For the sake of lisibility, the if ... then ... else ... if ...
indentation when parsing the options has been changed to:
if ... then
...
continue
fi
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ns_cgroup is a control group interacting with the namespaces. When a
new namespace is created, a corresponding cgroup is automatically created
too. The cgroup name is the pid of the process who did 'unshare' or the
child of 'clone'.
This cgroup is tied with the namespace because it prevents a process to
escape the control group and use the post_clone callback, so the child
cgroup inherits the values of the parent cgroup.
Unfortunately, the more we use this cgroup and the more we are facing
problems with it:
(1) when a process unshares, the cgroup name may conflict with a
previous cgroup with the same pid, so unshare or clone return -EEXIST
(2) the cgroup creation is out of control because there may have an
application creating several namespaces where the system will
automatically create several cgroups in his back and let them on the
cgroupfs (eg. a vrf based on the network namespace).
(3) the mix of (1) and (2) force an administrator to regularly check
and clean these cgroups.
This patchset removes the ns_cgroup by adding a new flag to the cgroup and
the cgroupfs mount option. It enables the copy of the parent cgroup when
a child cgroup is created. We can then safely remove the ns_cgroup as
this flag brings a compatibility. We have now to manually create and add
the task to a cgroup, which is consistent with the cgroup framework.
This patch:
Sent as an answer to a previous thread around the ns_cgroup.
https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
It adds a control file 'clone_children' for a cgroup. This control file
is a boolean specifying if the child cgroup should be a clone of the
parent cgroup or not. The default value is 'false'.
This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.
At present, the cpuset is the only one which had implemented the
post_clone callback.
The option can be set at mount time by specifying the 'clone_children'
mount option.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Cc: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are 4 state transitions possible for a freezer. Only FREEZING ->
FROZEN transaction is done lazily. This patch allows update_freezer_state
only to perform this transaction and renames the function to
update_if_frozen.
Moreover is_task_frozen_enough function is removed and its every occurence
is replaced with frozen(). Therefore for a group to become FROZEN every
task must be frozen.
The previous version could trigger a following bug: When cgroup is in the
process of freezing (but none of its tasks are frozen yet),
update_freezer_state() (called from freezer_read or freezer_write) would
incorrectly report that a group is 'THAWED' (because nfrozen = 0),
allowing the transaction FREEZING -> THAWED without writing anything to
'freezer.state'. This is incorrect according to the documentation. This
could result in a 'THAWED' cgroup with frozen tasks inside.
A code to reproduce this bug is available here:
http://pentium.hopto.org/~thinred/repos/linux-misc/freezer_bug2.c
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Tomasz Buchert <tomasz.buchert@inria.fr>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible to move a task from its cgroup even if this group is
'FREEZING'. This results in a nasty bug - the moved task will become
frozen OUTSIDE its original cgroup and will remain in a permanent 'D'
state.
This patch allows to migrate the task only between THAWED cgroups.
This behavior was observed and easily reproduced on a single core laptop.
Notice that reproducibility depends highly on the machine used. Program
and instructions how to reproduce the bug can be fetched from:
http://pentium.hopto.org/~thinred/repos/linux-misc/freezer_bug.c
Signed-off-by: Tomasz Buchert <tomasz.buchert@inria.fr>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The root freezer_state is always CGROUP_THAWED so we can remove the
special case from the code. The test itself can be handy and is extracted
to static function.
Signed-off-by: Tomasz Buchert <tomasz.buchert@inria.fr>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Typedef the pointer to the function to be called by smp_call_function() and
friends:
typedef void (*smp_call_func_t)(void *info);
as it is used in a fair number of places.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-arch@vger.kernel.org
Building with CONFIG_KALLSYMS=n gives following warning:
/mnt/src/linux-git/kernel/module.c: In function ‘post_relocation’:
/mnt/src/linux-git/kernel/module.c:2534:2: warning: passing argument 2 of ‘add_kallsyms’ discards qualifiers from pointer target type
/mnt/src/linux-git/kernel/module.c:2038:13: note: expected ‘struct load_info *’ but argument is of type ‘const struct load_info *’
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
Add more wait, wake, and completion interfaces to the device-drivers
docbook.
Fix kernel-doc notation in the added files.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding declaration of printk_ratelimit_state in ratelimit.h removes
potential build breakage and following sparse warning:
kernel/printk.c:1426:1: warning: symbol 'printk_ratelimit_state' was not declared. Should it be static?
[akpm@linux-foundation.org: remove unneeded ifdef]
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_option() takes its 2nd arg as int * so passing boot_delay to it
caused following warnings from sparse:
kernel/printk.c:223:27: warning: incorrect type in argument 2 (different signedness)
kernel/printk.c:223:27: expected int *pint
kernel/printk.c:223:27: got unsigned int static [toplevel] *<noident>
Since boot_delay can't grow more than 10,000 changing it to 'int *'
will not produce any problem.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e6bde73b07 ("cpu-hotplug: return
better errno on cpu hotplug failure"), the cpu notifier can return an
encapsulated errno value.
This converts the cpu notifier to return an encapsulated errno value for
stop_machine().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/stop_machine.c: In function `cpu_stopper_thread':
kernel/stop_machine.c:265: warning: unused variable `ksym_buf'
ksym_buf[] is unused if WARN_ON() is a no-op.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.
get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Silly though it is, completions and wait_queue_heads use foo_ONSTACK
(COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK,
__WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I
guess workqueues should do the same thing.
s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/
s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After all that's what they are intended for.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure kmap_atomic() usage is strictly nested
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's pointless to kill a task if another thread sharing its mm cannot be
killed to allow future memory freeing. A subsequent patch will prevent
kills in such cases, but first it's necessary to have a way to flag a task
that shares memory with an OOM_DISABLE task that doesn't incur an
additional tasklist scan, which would make select_bad_process() an O(n^2)
function.
This patch adds an atomic counter to struct mm_struct that follows how
many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
They cannot be killed by the kernel, so their memory cannot be freed in
oom conditions.
This only requires task_lock() on the task that we're operating on, it
does not require mm->mmap_sem since task_lock() pins the mm and the
operation is atomic.
[rientjes@google.com: changelog and sys_unshare() code]
[rientjes@google.com: protect oom_disable_count with task_lock in fork]
[rientjes@google.com: use old_mm for oom_disable_count in exec]
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocate space from the top of a region first, then work downward,
if an architecture desires this.
When we allocate space from a resource, we look for gaps between children
of the resource. Previously, we always looked at gaps from the bottom up.
For example, given this:
[mem 0xbff00000-0xf7ffffff] PCI Bus 0000:00
[mem 0xbff00000-0xbfffffff] gap -- available
[mem 0xc0000000-0xdfffffff] PCI Bus 0000:02
[mem 0xe0000000-0xf7ffffff] gap -- available
we attempted to allocate from the [mem 0xbff00000-0xbfffffff] gap first,
then the [mem 0xe0000000-0xf7ffffff] gap.
With this patch an architecture can choose to allocate from the top gap
[mem 0xe0000000-0xf7ffffff] first.
We can't do this across the board because iomem_resource.end is initialized
to 0xffffffff_ffffffff on 64-bit architectures, and most machines can't
address the entire 64-bit physical address space. Therefore, we only
allocate top-down if the arch requests it by clearing
"resource_alloc_from_bottom".
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
If tmp.start is near ~0, ALIGN(tmp.start) may overflow, which would
make us think there's more available space than there really is. We
would likely return something that conflicts with a previous resource,
which would cause a failure when allocate_resource() requests the newly-
allocated region.
Reference: https://bugzilla.redhat.com/show_bug.cgi?id=646027
Reported-by: Fabrice Bellet <fabrice@bellet.info>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The alignment callback returns a proposed location, which may have been
adjusted to avoid ISA aliases or for other architecture-specific reasons.
We already had a check ("tmp.start < tmp.end") to make sure the callback
doesn't return an area that extends past the available area. This patch
reworks the check to make sure it doesn't return an area that extends
either below or above the available area.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This factors out the min/max clipping to simplify find_resource().
No functional change.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This removes a test from find_resource(), which is getting cluttered.
No functional change.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.
Based on an earlier patch from Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.
Based on a patch originally from Eric Dumazet.
[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Andrew,
Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.
Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())
Thanks !
[PATCH V4] fs: allow for more than 2^31 files
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.
get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the MN10300 arch, we occasionally see an assertion being tripped in
alloc_cwqs() at the following line:
/* just in case, make sure it's actually aligned */
---> BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
return wq->cpu_wq.v ? 0 : -ENOMEM;
The values are:
wa->cpu_wq.v => 0x902776e0
align => 0x100
and align is calculated by the following:
const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS,
__alignof__(unsigned long long));
This is because the pointer in question (wq->cpu_wq.v) loses some of its
lower bits to control flags, and so the object it points to must be
sufficiently aligned to avoid the need to use those bits for pointing to
things.
Currently, 4 control bits and 4 colour bits are used in normal
circumstances, plus a debugging bit if debugging is set. This requires
the cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512
bytes aligned with debugging).
PERCPU() alignment on MN13000, however, is only 32 bytes as set in
vmlinux.lds.S. So we set this to PAGE_SIZE (4096) to match most other
arches and stick a comment in alloc_cwqs() for anyone else who triggers
the assertion.
Reported-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mark Salter <msalter@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove text_mutex locking in optimize_all_kprobes, because
this function doesn't modify text. It simply queues probes on
optimization list for kprobe_optimizer worker thread.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101025131801.19160.70939.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove some sched_info_queue(), sched_info_dequeue() code comment.
We no longer belongs to the era of O(1) and we don't use active or expired
array anymore.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <AANLkTi=REu0WzOp5N=nVT1=ZJ=ZA+MZFV+4CHSJ3Q-Yv@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew Morton pointed out almost all sched_setscheduler() callers are
using fixed parameters and can be converted to static. It reduces runtime
memory use a little.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'softirq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softirqs: Make wakeup_softirqd static
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, asm: Restore parentheses around one pushl_cfi argument
x86, asm: Fix ancient-GAS workaround
x86, asm: Fix CFI macro invocations to deal with shortcomings in gas
* 'x86-numa-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, numa: Assign CPUs to nodes in round-robin manner on fake NUMA
* 'x86-quirks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: HPET force enable for CX700 / VIA Epia LT
* 'x86-setup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, setup: Use string copy operation to optimze copy in kernel compression
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, UV: Use allocated buffer in tlb_uv.c:tunables_read()
* 'x86-vm86-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: Fix preemption bug for int1 debug and int3 breakpoint handlers.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kdb,debug_core: adjust master cpu switch logic against new debug_core locking
debug_core: refactor locking for master/slave cpus
x86,kgdb: remove unnecessary call to kgdb_correct_hw_break()
debug_core: disable hw_breakpoints on all cores in kgdb_cpu_enter()
kdb,kgdb: fix sparse fixups
kdb: Fix oops in kdb_unregister
kdb,ftdump: Remove reference to internal kdb include
kdb: Allow kernel loadable modules to add kdb shell functions
debug_core: stop rcu warnings on kernel resume
debug_core: move all watch dog syncs to a single function
x86,kgdb: fix debugger hw breakpoint test regression in 2.6.35
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove in_workqueue_context()
workqueue: Clarify that schedule_on_each_cpu is synchronous
memory_hotplug: drop spurious calls to flush_scheduled_work()
shpchp: update workqueue usage
pciehp: update workqueue usage
isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr()
workqueue: add and use WQ_MEM_RECLAIM flag
workqueue: fix HIGHPRI handling in keep_working()
workqueue: add queue_work and activate_work trace points
workqueue: prepare for more tracepoints
workqueue: implement flush[_delayed]_work_sync()
workqueue: factor out start_flush_work()
workqueue: cleanup flush/cancel functions
workqueue: implement alloc_ordered_workqueue()
Fix up trivial conflict in fs/gfs2/main.c as per Tejun
The kdb shell needs to enforce switching back to the original CPU that
took the exception before restoring normal kernel execution. Resuming
from a different CPU than what took the original exception will cause
problems with spin locks that are freed from the a different processor
than had taken the lock.
The special logic in dbg_cpu_switch() can go away entirely with
because the state of what cpus want to be masters or slaves will
remain unchanged between entry and exit of the debug_core exception
context.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
For quite some time there have been problems with memory barriers and
various races with NMI on multi processor systems using the kernel
debugger. The algorithm for entering the kernel debug core and
resuming kernel execution was racy and had several known edge case
problems with attempting to debug something on a heavily loaded system
using breakpoints that are hit repeatedly and quickly.
The prior "locking" design entry worked as follows:
* The atomic counter kgdb_active was used with atomic exchange in
order to elect a master cpu out of all the cpus that may have
taken a debug exception.
* The master cpu increments all elements of passive_cpu_wait[].
* The master cpu issues the round up cpus message.
* Each "slave cpu" that enters the debug core increments its own
element in cpu_in_kgdb[].
* Each "slave cpu" spins on passive_cpu_wait[] until it becomes 0.
* The master cpu debugs the system.
The new scheme removes the two arrays of atomic counters and replaces
them with 2 single counters. One counter is used to count the number
of cpus waiting to become a master cpu (because one or more hit an
exception). The second counter is use to indicate how many cpus have
entered as slave cpus.
The new entry logic works as follows:
* One or more cpus enters via kgdb_handle_exception() and increments
the masters_in_kgdb. Each cpu attempts to get the spin lock called
dbg_master_lock.
* The master cpu sets kgdb_active to the current cpu.
* The master cpu takes the spinlock dbg_slave_lock.
* The master cpu asks to round up all the other cpus.
* Each slave cpu that is not already in kgdb_handle_exception()
will enter and increment slaves_in_kgdb. Each slave will now spin
try_locking on dbg_slave_lock.
* The master cpu waits for the sum of masters_in_kgdb and slaves_in_kgdb
to be equal to the sum of the online cpus.
* The master cpu debugs the system.
In the new design the kgdb_active can only be changed while holding
dbg_master_lock. Stress testing has not turned up any further
entry/exit races that existed in the prior locking design. The prior
locking design suffered from atomic variables not being truly atomic
(in the capacity as used by kgdb) along with memory barrier races.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Dongdong Deng <dongdong.deng@windriver.com>
The slave cpus do not have the hw breakpoints disabled upon entry to
the debug_core and as a result could cause unrecoverable recursive
faults on badly placed breakpoints, or get out of sync with the arch
specific hw breakpoint operations.
This patch addresses the problem by invoking kgdb_disable_hw_debug()
earlier in kgdb_enter_cpu for each cpu that enters the debug core.
The hw breakpoint dis/enable flow should be:
master_debug_cpu slave_debug_cpu
\ /
kgdb_cpu_enter
|
kgdb_disable_hw_debug --> uninstall pre-enabled hw_breakpoint
|
do add/rm dis/enable operates to hw_breakpoints on master_debug_cpu..
|
correct_hw_break --> correct/install the enabled hw_breakpoint
|
leave_kgdb
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Fix the following sparse warnings:
kdb_main.c:328:5: warning: symbol 'kdbgetu64arg' was not declared. Should it be static?
kgdboc.c:246:12: warning: symbol 'kgdboc_early_init' was not declared. Should it be static?
kgdb.c:652:26: warning: incorrect type in argument 1 (different address spaces)
kgdb.c:652:26: expected void const *ptr
kgdb.c:652:26: got struct perf_event *[noderef] <asn:3>*pev
The one in kgdb.c required the (void * __force) because of the return
code from register_wide_hw_breakpoint looking like:
return (void __percpu __force *)ERR_PTR(err);
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Nothing should try to use kdb_commands directly as sometimes it is
null. Instead, use the for_each_kdbcmd() iterator.
This particular problem dates back to the initial kdb merge (2.6.35),
but at that point nothing was dynamically unregistering commands from
the kdb shell.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Now that include/linux/kdb.h properly exports all the functions
required to dynamically add a kdb shell command, the reference to the
private kdb header can be removed.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
In order to allow kernel modules to dynamically add a command to the
kdb shell the kdb_register, kdb_register_repeat, kdb_unregister, and
kdb_printf need to be exported as GPL symbols.
Any kernel module that adds a dynamic kdb shell function should only
need to include linux/kdb.h.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When returning from the kernel debugger reset the rcu jiffies_stall
value to prevent the rcu stall detector from sending NMI events which
invoke a stack dump for each cpu in the system.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Move the various clock and watch dog syncs to a single function in
advance of adding another sync for the rcu stall detector.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
BKL: remove BKL from freevxfs
BKL: remove BKL from qnx4
autofs4: Only declare function when CONFIG_COMPAT is defined
autofs: Only declare function when CONFIG_COMPAT is defined
ncpfs: Lock socket in ncpfs while setting its callbacks
fs/locks.c: prepare for BKL removal
BKL: Remove BKL from ncpfs
BKL: Remove BKL from OCFS2
BKL: Remove BKL from squashfs
BKL: Remove BKL from jffs2
BKL: Remove BKL from ecryptfs
BKL: Remove BKL from afs
BKL: Remove BKL from USB gadgetfs
BKL: Remove BKL from autofs4
BKL: Remove BKL from isofs
BKL: Remove BKL from fat
BKL: Remove BKL from ext2 filesystem
BKL: Remove BKL from do_new_mount()
BKL: Remove BKL from cgroup
BKL: Remove BKL from NTFS
...
* 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
BKL: introduce CONFIG_BKL.
dabusb: remove the BKL
sunrpc: remove the big kernel lock
init/main.c: remove BKL notations
blktrace: remove the big kernel lock
rtmutex-tester: make it build without BKL
dvb-core: kill the big kernel lock
dvb/bt8xx: kill the big kernel lock
tlclk: remove big kernel lock
fix rawctl compat ioctls breakage on amd64 and itanic
uml: kill big kernel lock
parisc: remove big kernel lock
cris: autoconvert trivial BKL users
alpha: kill big kernel lock
isapnp: BKL removal
s390/block: kill the big kernel lock
hpet: kill BKL, add compat_ioctl
The recent changes in the genirq core unearthed a bug in arch/um which
called enable_irq() before the interrupt was set up.
Warn and return instead of crashing the machine with a NULL pointer
dereference.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Weinberger <richard@nod.at>
Add explict warning when del_timer_sync() is called in hardirq
context.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Actually we have used del_timer_sync() in softirq context for a long time,
e.g. in __dst_free()::cancel_delayed_work().
So change the comments of it to warn on hardirq context only, and make
lockdep know about this change.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On UP try_to_del_timer_sync() is mapped to del_timer() which does not
take the running timer callback into account, so it has different
semantics.
Remove the SMP dependency of try_to_del_timer_sync() by using
base->running_timer in the UP case as well.
[ tglx: Removed set_running_timer() inline and tweaked the changelog ]
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This new version (see commit 8e5fc1a) is much simpler and ensures that
in case of error in group_sched_in() during event_sched_in(), the
events up to the failed event go through regular event_sched_out().
But the failed event and the remaining events in the group have their
timings adjusted as if they had also gone through event_sched_in() and
event_sched_out(). This ensures timing uniformity across all events in
a group. This also takes care of the tstamp_stopped problem in case
the group could never be scheduled. The tstamp_stopped is updated as
if the event had actually run.
With this patch, the following now reports correct time_enabled,
in case the NMI watchdog is active:
$ task -e unhalted_core_cycles,instructions_retired,baclears,baclears
noploop 1
noploop for 1 seconds
0 unhalted_core_cycles (100.00% scaling, ena=997,552,872, run=0)
0 instructions_retired (100.00% scaling, ena=997,552,872, run=0)
0 baclears (100.00% scaling, ena=997,552,872, run=0)
0 baclears (100.00% scaling, ena=997,552,872, run=0)
And the older test case also works:
$ task -einstructions_retired,baclears,baclears -e
unhalted_core_cycles,baclears,baclears sleep 5
1680885 instructions_retired (69.39% scaling, ena=950756, run=291006)
10735 baclears (69.39% scaling, ena=950756, run=291006)
10735 baclears (69.39% scaling, ena=950756, run=291006)
0 unhalted_core_cycles (100.00% scaling, ena=817932, run=0)
0 baclears (100.00% scaling, ena=817932, run=0)
0 baclears (100.00% scaling, ena=817932, run=0)
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4cbeeebc.8ee7d80a.5a28.0d5f@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch reverts commit 8e5fc1a (perf_events: Fix transaction
recovery in group_sched_in()) because it had one flaw in case the
group could never be scheduled. It would cause time_enabled to get
negative.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4cbeeeb7.0aefd80a.6e40.0e2f@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dima noticed that we fail to correct the ->vruntime of sleeping tasks
when we move them between cgroups.
Reported-by: Dima Zavin <dima@android.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mike Galbraith <efault@gmx.de>
LKML-Reference: <1287150604.29097.1513.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (71 commits)
powerpc/44x: Update ppc44x_defconfig
powerpc/watchdog: Make default timeout for Book-E watchdog a Kconfig option
fsl_rio: Add comments for sRIO registers.
powerpc/fsl-booke: Add e55xx (64-bit) smp defconfig
powerpc/fsl-booke: Add p5020 DS board support
powerpc/fsl-booke64: Use TLB CAMs to cover linear mapping on FSL 64-bit chips
powerpc/fsl-booke: Add support for FSL Arch v1.0 MMU in setup_page_sizes
powerpc/fsl-booke: Add support for FSL 64-bit e5500 core
powerpc/85xx: add cache-sram support
powerpc/85xx: add ngPIXIS FPGA device tree node to the P1022DS board
powerpc: Fix compile error with paca code on ppc64e
powerpc/fsl-booke: Add p3041 DS board support
oprofile/fsl emb: Don't set MSR[PMM] until after clearing the interrupt.
powerpc/fsl-booke: Add PCI device ids for P2040/P3041/P5010/P5020 QoirQ chips
powerpc/mpc8xxx_gpio: Add support for 'qoriq-gpio' controllers
powerpc/fsl_booke: Add support to boot from core other than 0
powerpc/p1022: Add probing for individual DMA channels
powerpc/fsl_soc: Search all global-utilities nodes for rstccr
powerpc: Fix invalid page flags in create TLB CAM path for PTE_64BIT
powerpc/mpc83xx: Support for MPC8308 P1M board
...
Fix up conflict with the generic irq_work changes in arch/powerpc/kernel/time.c
* 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits)
x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S
xen: Cope with unmapped pages when initializing kernel pagetable
memblock, bootmem: Round pfn properly for memory and reserved regions
memblock: Annotate memblock functions with __init_memblock
memblock: Allow memblock_init to be called early
memblock/arm: Fix memblock_region_is_memory() typo
x86, memblock: Remove __memblock_x86_find_in_range_size()
memblock: Fix wraparound in find_region()
x86-32, memblock: Make add_highpages honor early reserved ranges
x86, memblock: Fix crashkernel allocation
arm, memblock: Fix the sparsemem build
memblock: Fix section mismatch warnings
powerpc, memblock: Fix memblock API change fallout
memblock, microblaze: Fix memblock API change fallout
x86: Remove old bootmem code
x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve
x86: Remove not used early_res code
x86, memblock: Replace e820_/_early string with memblock_
x86: Use memblock to replace early_res
x86, memblock: Use memblock_debug to control debug message print out
...
Fix up trivial conflicts in arch/x86/kernel/setup.c and kernel/Makefile
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (278 commits)
arm: remove machine_desc.io_pg_offst and .phys_io
arm: use addruart macro to establish debug mappings
arm: return both physical and virtual addresses from addruart
arm/debug: consolidate addruart macros for CONFIG_DEBUG_ICEDCC
ARM: make struct machine_desc definition coherent with its comment
eukrea_mbimxsd-baseboard: Pass the correct GPIO to gpio_free
cpuimx27: fix compile when ULPI is selected
mach-pcm037_eet: fix compile errors
Fixing ethernet driver compilation error for i.MX31 ADS board
cpuimx51: update board support
mx5: add cpuimx51sd module and its baseboard
iomux-mx51: fix GPIO_1_xx 's IOMUX configuration
imx-esdhc: update devices registration
mx51: add resources for SD/MMC on i.MX51
iomux-mx51: fix SD1 and SD2's iomux configuration
clock-mx51: rename CLOCK1 to CLOCK_CCGR for better readability
clock-mx51: factorize clk_set_parent and clk_get_rate
eukrea_mbimxsd: add support for DVI displays
cpuimx25 & cpuimx35: fix OTG port registration in host mode
i.MX31 and i.MX35 : fix errate TLSbo65953 and ENGcm09472
...
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (96 commits)
apic, x86: Use BIOS settings for IBS and MCE threshold interrupt LVT offsets
apic, x86: Check if EILVT APIC registers are available (AMD only)
x86: ioapic: Call free_irte only if interrupt remapping enabled
arm: Use ARCH_IRQ_INIT_FLAGS
genirq, ARM: Fix boot on ARM platforms
genirq: Fix CONFIG_GENIRQ_NO_DEPRECATED=y build
x86: Switch sparse_irq allocations to GFP_KERNEL
genirq: Switch sparse_irq allocator to GFP_KERNEL
genirq: Make sparse_lock a mutex
x86: lguest: Use new irq allocator
genirq: Remove the now unused sparse irq leftovers
genirq: Sanitize dynamic irq handling
genirq: Remove arch_init_chip_data()
x86: xen: Sanitise sparse_irq handling
x86: Use sane enumeration
x86: uv: Clean up the direct access to irq_desc
x86: Make io_apic.c local functions static
genirq: Remove irq_2_iommu
x86: Speed up the irq_remapped check in hot pathes
intr_remap: Simplify the code further
...
Fix up trivial conflicts in arch/x86/Kconfig
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
sched: Export account_system_vtime()
sched: Call tick_check_idle before __irq_enter
sched: Remove irq time from available CPU power
sched: Do not account irq time to current task
x86: Add IRQ_TIME_ACCOUNTING
sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
sched: Add a PF flag for ksoftirqd identification
sched: Consolidate account_system_vtime extern declaration
sched: Fix softirq time accounting
sched: Drop group_capacity to 1 only if local group has extra capacity
sched: Force balancing on newidle balance if local group has capacity
sched: Set group_imb only a task can be pulled from the busiest cpu
sched: Do not consider SCHED_IDLE tasks to be cache hot
sched: Drop all load weight manipulation for RT tasks
sched: Create special class for stop/migrate work
sched: Unindent labels
sched: Comment updates: fix default latency and granularity numbers
tracing/sched: Add sched_pi_setprio tracepoint
sched: Give CPU bound RT tasks preference
sched: Try not to migrate higher priority RT tasks
...
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Check the depth of subclass
lockdep: Add improved subclass caching
affs: Use sema_init instead of init_MUTEX
hfs: Convert tree_lock to mutex
arm: Bcmring: semaphore cleanup
printk: Make console_sem a semaphore not a pseudo mutex
drivers/macintosh/adb: Do not claim that the semaphore is a mutex
parport: Semaphore cleanup
irda: Semaphore cleanup
net: Wan/cosa.c: Convert "mutex" to semaphore
net: Ppp_async: semaphore cleanup
hamradio: Mkiss: semaphore cleanup
hamradio: 6pack: semaphore cleanup
net: 3c527: semaphore cleanup
input: Serio/hp_sdc: semaphore cleanup
input: Serio/hil_mlc: semaphore cleanup
input: Misc/hp_sdc_rtc: semaphore cleanup
lockup_detector: Make callback function static
lockup detector: Fix grammar by adding a missing "to" in the comments
lockdep: Remove __debug_show_held_locks
lock_timer() conditionally grabs it_lock in case of returning non-NULL
but unlock_timer() releases it unconditionally. This leads sparse to
complain about the lock context imbalance. Rename and wrap lock_timer
using __cond_lock() macro to make sparse happy.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, you have to just define a delayed_work uninitialised, and then
initialise it before first use. That's a tad clumsy. At risk of playing
mind-games with the compiler, fooling it into doing pointer arithmetic
with compile-time-constants, this lets clients properly initialise delayed
work with deferrable timers statically.
This patch was inspired by the issues which lead Artem Bityutskiy to
commit 8eab945c56 ("sunrpc: make the cache cleaner workqueue
deferrable").
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Replace sizeof(buffer)/sizeof(buffer[0]) with ARRAY_SIZE(buffer) in
kernel/time/timecompare.c
Signed-off-by: Nikitas Angelinas <nikitasangelinas@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When the clocksource is not a multiple of HZ, the clock will be off. For
acpi_pm, HZ=1000 the error is 127.111 ppm:
The rounding of cycle_interval ends up generating a false error term in
ntp_error accumulation since xtime_interval is not exactly 1/HZ. So, we
subtract out the error caused by the rounding.
This has been visible since 2.6.32-rc2
commit a092ff0f90
time: Implement logarithmic time accumulation
That commit raised NTP_INTERVAL_FREQ and exposed the rounding error.
testing tool: http://n1.taur.dk/permanent/testpmt.c
Also tested with ntpd and a frequency counter.
Signed-off-by: Kasper Pedersen <kkp2010@kasperkp.dk>
Acked-by: john stultz <johnstul@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the addition of trace_softirq_raise() the softirq tracepoint got
even more convoluted. Why the tracepoints take two pointers to assign
an integer is beyond my comprehension.
But adding an extra case which treats the first pointer as an unsigned
long when the second pointer is NULL including the back and forth
type casting is just horrible.
Convert the softirq tracepoints to take a single unsigned int argument
for the softirq vector number and fix the call sites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <alpine.LFD.2.00.1010191428560.6815@localhost6.localdomain6>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: mathieu.desnoyers@efficios.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
The tracing per_cpu buffers were limited to 999 CPUs for a mear
savings in stack space of a char array. Up the array to 30 characters
which is more than enough to hold a 64 bit number.
Reported-by: Robin Holt <holt@sgi.com>
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All security modules shouldn't change sched_param parameter of
security_task_setscheduler(). This is not only meaningless, but also
make a harmful result if caller pass a static variable.
This patch remove policy and sched_param parameter from
security_task_setscheduler() becuase none of security module is
using it.
Cc: James Morris <jmorris@namei.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: James Morris <jmorris@namei.org>
With the binding of time extends to events we no longer need to use
the macro RB_TIMESTAMPS_PER_PAGE. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a condition to check if we should add a time extend or
not in the fast path. But this condition is racey (in the sense
that we can add a unnecessary time extend, but nothing that
can break anything). We later check if the time or event time
delta should be zero or have real data in it (not racey), making
this first check redundant.
This check may help save space once in a while, but really is
not worth the hassle to try to save some space that happens at
most 134 ms at a time.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the time between two timestamps is greater than
2^27 nanosecs (~134 ms) a time extend event is added that extends
the time difference to 59 bits (~18 years). This is due to
events only having a 27 bit field to store time.
Currently this time extend is a separate event. We add it just before
the event data that is being written to the buffer. But before
the event data is committed, the event data can also be discarded (as
with the case of filters). But because the time extend has already been
committed, it will stay in the buffer.
If lots of events are being filtered and no event is being
written, then every 134ms a time extend can be added to the buffer
without any data attached. To keep from filling the entire buffer
with time extends, a time extend will never be the first event
in a page because the page timestamp can be used. Time extends can
only fill the rest of a page with some data at the beginning.
This patch binds the time extend with the data. The difference here
is that the time extend is not committed before the data is added.
Instead, when a time extend is needed, the space reserved on
the ring buffer is the time extend + the data event size. The
time extend is added to the first part of the reserved block and
the data is added to the second. The time extend event is passed
back to the reserver, but since the reserver also uses a function
to find the data portion of the reserved block, no changes to the
ring buffer interface need to be made.
When a commit is discarded, we now remove both the time extend and
the event. With this approach no more than one time extend can
be in the buffer in a row. Data must always follow a time extend.
Thanks to Mathieu Desnoyers for suggesting this idea.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The delta between events is passed to the timestamp code by reference
and the timestamp code will reset the value. But it can be reset
from the caller. No need to pass it in by reference.
By changing the call to pass by value, lets gcc optimize the code
a bit more where it can store the delta in a register and not
worry about updating the reference.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original code for the ring buffer had locations that modified
the timestamp and that change was used by the callers. Now,
the timestamp is not reused by the callers and there is no reason
to pass it by reference.
By changing the call to pass by value, lets gcc optimize the code
a bit more where it can store the timestamp in a register and not
worry about updating the reference.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Gcc inlines the slow path of the ring buffer write which can
hurt performance. This patch simply forces the slow path function
rb_move_tail() to always be a function.
The ring_buffer_benchmark module with reader_disabled=1 shows that
this patch changes the time to record an event from 135 ns to
132 ns. (3 ns or 2.22% improvement)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function start_func_tracer() was incorrectly added in the
#ifdef CONFIG_FUNCTION_TRACER condition, but is still used even
when function tracing is not enabled.
The calls to register_ftrace_function() and register_ftrace_graph()
become nops (and their arguments are even ignored), thus there is
no reason to hide start_func_tracer() when function tracing is
not enabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
futex_wait() is leaking key references due to futex_wait_setup()
acquiring an additional reference via the queue_lock() routine. The
nested key ref-counting has been masking bugs and complicating code
analysis. queue_lock() is only called with a previously ref-counted
key, so remove the additional ref-counting from the queue_(un)lock()
functions.
Also futex_wait_requeue_pi() drops one key reference too many in
unqueue_me_pi(). Remove the key reference handling from
unqueue_me_pi(). This was paired with a queue_lock() in
futex_lock_pi(), so the count remains unchanged.
Document remaining nested key ref-counting sites.
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Reported-and-tested-by: Matthieu Fertré<matthieu.fertre@kerlabs.com>
Reported-by: Louis Rilling<louis.rilling@kerlabs.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <4CBB17A8.70401@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
According to Jens, this code does not need the BKL at all,
it is sufficiently serialized by bd_mutex.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
The big kernel lock is going away, so make sure
that if it is disabled by Kconfig, we do not
try to validate it, which would result in
compile errors.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Commit a25909a4 (lockdep: Add an in_workqueue_context() lockdep-based
test function) added in_workqueue_context() but there hasn't been any
in-kernel user and the lockdep annotation in workqueue is scheduled to
change. Remove the unused function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The documentation for schedule_on_each_cpu() states that it calls a
function on each online CPU from keventd. This can easily be
interpreted as an asyncronous call because the description does not
mention that flush_work is called. Clarify that it is synchronous.
tj: rephrased a bit
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
to notify interruption from idle. But, there is a problem if this call
is done after __irq_enter, as all routines in __irq_enter may find
stale time due to yet to be done tick_check_idle.
Specifically, trace calls in __irq_enter when they use global clock and also
account_system_vtime change in this patch as it wants to use sched_clock_cpu()
to do proper irq timing.
But, tick_check_idle was moved after __irq_enter intentionally to
prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a:
irq: call __irq_enter() before calling the tick_idle_check
Impact: avoid spurious ksoftirqd wakeups
Moving tick_check_idle() before __irq_enter and wrapping it with
local_bh_enable/disable would solve both the problems.
Fixed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The idea was suggested by Peter Zijlstra here:
http://marc.info/?l=linux-kernel&m=127476934517534&w=2
irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().
Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.
Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.
Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.
This change will impact scheduling behavior in interrupt heavy conditions.
Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.
Now, if I run another CPU intensive task on CPU 2, without this change
/proc/<pid>/schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.
Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.
This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.
This patch just adds the core logic and does not enable this logic yet.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To account softirq time cleanly in scheduler, we need to identify whether
softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
Add PF_KSOFTIRQD for that purpose.
As all PF flag bits are currently taken, create space by moving one of the
infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
with some other state fields.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:
http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.
Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.
Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.
Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.
Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).
For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:
- find_busiest_group() will always pick the sched group containing the niced
task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
nice0 task (never picks the cpu with the nice -15 task since
weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
at which point it kicks off the active load balancer, wakes up the migration
thread and kicks the nice 0 task off the cpu.
The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.
When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are
not relevant because the niced task has a very large weight.
In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.
It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
group_capacity down to 1 if we prefer siblings.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
likely be > sgs.group_capacity.
- When balancing large weight tasks, if the local group does not have extra
capacity, we do not pick the group with the niced task as the busiest group.
This prevents failed balances, active migration and the under-utilization
described above.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch forces a load balance on a newly idle cpu when the local group has
extra capacity and the busiest group does not have any. It improves system
utilization when balancing tasks with a large weight differential.
Under certain situations, such as a niced down task (i.e. nice = -15) in the
presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
kicks away other tasks because of its large weight. This leads to sub-optimal
utilization of the machine. Even though the sched group has capacity, it does
not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
With this patch, if the local group has extra capacity, we shortcut the checks
in f_b_g() and try to pull a task over. A sched group has extra capacity if the
group capacity is greater than the number of running tasks in that group.
Thanks to Mike Galbraith for discussions leading to this patch and for the
insight to reuse SD_NEWIDLE_BALANCE.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
[ small code readability edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a check in task_hot to return if the task has SCHED_IDLE
policy. SCHED_IDLE tasks have very low weight, and when run with regular
workloads, are typically scheduled many milliseconds apart. There is no
need to consider these tasks hot for load balancing.
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Trades a call + conditional + ret for an unconditional jmp.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101014203625.501657727@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that there's still only a few users around, rename things to make
them more consistent.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101014203625.448565169@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
hw_breakpoint creation needs to account stuff per-task to ensure there
is always sufficient hardware resources to back these things due to
ptrace.
With the perf per pmu context changes the event initialization no
longer has access to the event context, for the simple reason that we
need to first find the pmu (result of initialization) before we can
find the context.
This makes hw_breakpoints unhappy, because it can no longer do per
task accounting, cure this by frobbing a task pointer in the event::hw
bits for now...
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20101014203625.391543667@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So that we can pass the task pointer to the event allocation, so that
we can use task associated data during event initialization.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101014203625.340789919@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently it looks like find_lively_task_by_vpid() takes a task ref
and relies on find_get_context() to drop it.
The problem is that perf_event_create_kernel_counter() shouldn't be
dropping task refs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
LKML-Reference: <20101014203625.278436085@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Matt found we trigger the WARN_ON_ONCE() in perf_group_attach() when we take
the move_group path in perf_event_open().
Since we cannot de-construct the group (we rely on it to move the events), we
have to simply ignore the double attach. The group state is context invariant
and doesn't need changing.
Reported-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1287135757.29097.1368.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide a mechanism that allows running code in IRQ context. It is
most useful for NMI code that needs to interact with the rest of the
system -- like wakeup a task to drain buffers.
Perf currently has such a mechanism, so extract that and provide it as
a generic feature, independent of perf so that others may also
benefit.
The IRQ context callback is generated through self-IPIs where
possible, or on architectures like powerpc the decrementer (the
built-in timer facility) is set to generate an interrupt immediately.
Architectures that don't have anything like this get to do with a
callback from the timer tick. These architectures can call
irq_work_run() at the tail of any IRQ handlers that might enqueue such
work (like the perf IRQ handler) to avoid undue latencies in
processing the work.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[ various fixes ]
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The group_sched_in() function uses a transactional approach to schedule
a group of events. In a group, either all events can be scheduled or
none are. To schedule each event in, the function calls event_sched_in().
In case of error, event_sched_out() is called on each event in the group.
The problem is that event_sched_out() does not completely cancel the
effects of event_sched_in(). Furthermore event_sched_out() changes the
state of the event as if it had run which is not true is this particular
case.
Those inconsistencies impact time tracking fields and may lead to events
in a group not all reporting the same time_enabled and time_running values.
This is demonstrated with the example below:
$ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
1946101 unhalted_core_cycles (32.85% scaling, ena=829181, run=556827)
11423 baclears (32.85% scaling, ena=829181, run=556827)
7671 baclears (0.00% scaling, ena=556827, run=556827)
2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995)
11705 baclears (57.83% scaling, ena=962822, run=405995)
11705 baclears (57.83% scaling, ena=962822, run=405995)
Notice that in the first group, the last baclears event does not
report the same timings as its siblings.
This issue comes from the fact that tstamp_stopped is updated
by event_sched_out() as if the event had actually run.
To solve the issue, we must ensure that, in case of error, there is
no change in the event state whatsoever. That means timings must
remain as they were when entering group_sched_in().
To do this we defer updating tstamp_running until we know the
transaction succeeded. Therefore, we have split event_sched_in()
in two parts separating the update to tstamp_running.
Similarly, in case of error, we do not want to update tstamp_stopped.
Therefore, we have split event_sched_out() in two parts separating
the update to tstamp_stopped.
With this patch, we now get the following output:
$ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841)
11243 baclears (71.75% scaling, ena=1093330, run=308841)
11243 baclears (71.75% scaling, ena=1093330, run=308841)
1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489)
9253 baclears (0.00% scaling, ena=784489, run=784489)
9253 baclears (0.00% scaling, ena=784489, run=784489)
Note that the uneven timing between groups is a side effect of
the process spending most of its time sleeping, i.e., not enough
event rotations (but that's a separate issue).
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4cb86b4c.41e9d80a.44e9.3e19@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
You can only call update_context_time() when the context
is active, i.e., the thread it is attached to is still running.
However, perf_event_read() can be called even when the context
is inactive, e.g., user read() the counters. The call to
update_context_time() must be conditioned on the status of
the context, otherwise, bogus time_enabled, time_running may
be returned. Here is an example on AMD64. The task program
is an example from libpfm4. The -p prints deltas every 1s.
$ task -p -e cpu_clk_unhalted sleep 5
2,266,610 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
5,242,358,071 cpu_clk_unhalted (99.95% scaling, ena=5,000,359,984, run=2,319,270)
Whereas if you don't read deltas, e.g., no call to perf_event_read() until
the process terminates:
$ task -e cpu_clk_unhalted sleep 5
2,497,783 cpu_clk_unhalted (0.00% scaling, ena=2,376,899, run=2,376,899)
Notice that time_enable, time_running are bogus in the first example
causing bogus scaling.
This patch fixes the problem, by conditionally calling update_context_time()
in perf_event_read().
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
LKML-Reference: <4cb856dc.51edd80a.5ae0.38fb@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current look_up_lock_class() doesn't check the parameter "subclass".
This rarely rises problems because the main caller of this function,
register_lock_class(), checks it.
But register_lock_class() is not the only function which calls
look_up_lock_class(). lock_set_class() and its callees also call it.
And lock_set_class() doesn't check this parameter.
This will rise problems when the the value of subclass is larger than
MAX_LOCKDEP_SUBCLASSES. Because the address (used as the key of class)
caliculated with too large subclass has a probability to point
another key in different lock_class_key.
Of course this problem depends on the memory layout and
occurs with really low probability.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Vojtech Pavlik <vojtech@ucw.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286958626-986-1-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current lockdep_map only caches one class with subclass == 0,
and looks up hash table of classes when subclass != 0.
It seems that this has no problem because the case of
subclass != 0 is rare. But locks of struct rq are
acquired with subclass == 1 when task migration is executed.
Task migration is high frequent event, so I modified lockdep
to cache subclasses.
I measured the score of perf bench sched messaging.
This patch has slightly but certain (order of milli seconds
or 10 milli seconds) effect when lots of tasks are running.
I'll show the result in the tail of this description.
NR_LOCKDEP_CACHING_CLASSES specifies how many classes can be
cached in the instances of lockdep_map.
I discussed with Peter Zijlstra in LinuxCon Japan about
this approach and he taught me that caching every subclasses(8)
is cleary waste of memory. So number of cached classes
should be configurable.
=== Score comparison of benchmarks ===
# "min" means best score, and "max" means worst score
for i in `seq 1 10`; do ./perf bench -f simple sched messaging; done
before: min: 0.565000, max: 0.583000, avg: 0.572500
after: min: 0.559000, max: 0.568000, avg: 0.563300
# with more processes
for i in `seq 1 10`; do ./perf bench -f simple sched messaging -g 40; done
before: min: 2.274000, max: 2.298000, avg: 2.286300
after: min: 2.242000, max: 2.270000, avg: 2.259700
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286269311-28336-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Load weights are for the CFS, they do not belong in the RT task. This makes all
RT scheduling classes leave the CFS weights alone.
This fixes a real bug as well: I noticed the following phonomena: a process
elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
inserted by set_load_weight() remains at 0, giving the task insignificat
priority.
With this fix, the weight is reset to what the task had before being elevated
to SCHED_RR/SCHED_FIFO.
Cc: Lennart Poettering <lennart@poettering.net>
Cc: stable@kernel.org
Signed-off-by: Linus Walleij <linus.walleij@stericsson.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to separate the stop/migrate work thread from the SCHED_FIFO
implementation, create a special class for it that is of higher priority than
SCHED_FIFO itself.
This currently solves a problem where cpu-hotplug consumes so much cpu-time
that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
timer pending on the now dead cpu.
It is also required for when we add the planned deadline scheduling class above
SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1285165776.2275.1022.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Labels should be on column 0.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Even though the parent is recorded with the normal function tracing
of the latency tracers (irqsoff and wakeup), the function graph
recording is bogus.
This is due to the function graph messing with the return stack.
The latency tracers pass in as the parent CALLER_ADDR0, which
works fine for plain function tracing. But this causes bogus output
with the graph tracer:
3) <idle>-0 | d.s3. 0.000 us | return_to_handler();
3) <idle>-0 | d.s3. 0.000 us | _raw_spin_unlock_irqrestore();
3) <idle>-0 | d.s3. 0.000 us | return_to_handler();
3) <idle>-0 | d.s3. 0.000 us | trace_hardirqs_on();
The "return_to_handle()" call is the trampoline of the
function graph tracer, and is meaningless in this context.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The preempt and irqsoff tracers have three types of function tracers.
Normal function tracer, function graph entry, and function graph return.
Each of these use a complex dance to prevent recursion and whether
to trace the data or not (depending if interrupts are enabled or not).
This patch moves the duplicate code into a single routine, to
prevent future mistakes with modifying duplicate complex code.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The wakeup tracer has three types of function tracers. Normal
function tracer, function graph entry, and function graph return.
Each of these use a complex dance to prevent recursion and whether
to trace the data or not (depending on the wake_task variable).
This patch moves the duplicate code into a single routine, to
prevent future mistakes with modifying duplicate complex code.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add function graph support for wakeup latency tracer.
The graph output is enabled by setting the 'display-graph'
trace option.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1285243253-7372-4-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move trace_graph_function() and print_graph_headers_flags() functions
to the trace_function_graph.c to be globaly available.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1285243253-7372-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The check_irq_entry and check_irq_return could be called
from graph event context. In such case there's no graph
private data allocated. Adding checks to handle this case.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20100924154102.GB1818@jolsa.brq.redhat.com>
[ Fixed some grammar in the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
SoCs have a standard set of tuples consisting of frequency and
voltage pairs that the device will support per voltage domain. These
are called Operating Performance Points or OPPs. The actual
definitions of OPP varies over silicon versions. For a specific domain,
we can have a set of {frequency, voltage} pairs. As the kernel boots
and more information is available, a default set of these are activated
based on the precise nature of device. Further on operation, based on
conditions prevailing in the system (such as temperature), some OPP
availability may be temporarily controlled by the SoC frameworks.
To implement an OPP, some sort of power management support is necessary
hence this library depends on CONFIG_PM.
Contributions include:
Sanjeev Premi for the initial concept:
http://patchwork.kernel.org/patch/50998/
Kevin Hilman for converting original design to device-based.
Kevin Hilman and Paul Walmsey for cleaning up many of the function
abstractions, improvements and data structure handling.
Romit Dasgupta for using enums instead of opp pointers.
Thara Gopinath, Eduardo Valentin and Vishwanath BS for fixes and
cleanups.
Linus Walleij for recommending this layer be made generic for usage
in other architectures beyond OMAP and ARM.
Mark Brown, Andrew Morton, Rafael J. Wysocki, Paul E. McKenney for
valuable improvements.
Discussions and comments from:
http://marc.info/?l=linux-omap&m=126033945313269&w=2http://marc.info/?l=linux-omap&m=125482970102327&w=2http://marc.info/?t=125809247500002&r=1&w=2http://marc.info/?l=linux-omap&m=126025973426007&w=2http://marc.info/?t=128152609200064&r=1&w=2http://marc.info/?t=128468723000002&r=1&w=2
incorporated.
v1: http://marc.info/?t=128468723000002&r=1&w=2
Signed-off-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If the device which fails to resume is part of a loadable kernel module
it won't be checked at startup against the magic number stored in the
RTC.
Add a read-only sysfs attribute /sys/power/pm_trace_dev_match which
contains a list of newline separated devices (usually just the one)
which currently match the last magic number. This allows the device
which is failing to resume to be found after the modules are loaded
again.
Signed-off-by: James Hogan <james@albanarts.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Since we are adding compression to the kernel's hibernate code,
change signature used by it to mark swap spaces, so that earlier
kernels don't attempt to restore compressed images they cannot
handle.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
If there is a wakeup event during the freezing of tasks, suspend or
hibernation will fail anyway. Since try_to_freeze_tasks() can take
up to 20 seconds to complete or fail, aborting it as soon as a wakeup
event is detected improves the worst case wakeup latency.
Based on a patch from Arve Hjønnevåg.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
The hibernate resume code checks if there is an image to resume from
on every boot and, if the kernel is built with CONFIG_PM_DEBUG set
and the image is not present, it prints some scary messages
suggesting there was a boot error of some sort. Apparently, some
users are confused by them, so make them look less scary and adjust
the other hibernate resume debug messages to match them.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Introduce struct wakeup_source for representing system wakeup sources
within the kernel and for collecting statistics related to them.
Make the recently introduced helper functions pm_wakeup_event(),
pm_stay_awake() and pm_relax() use struct wakeup_source objects
internally, so that wakeup statistics associated with wakeup devices
can be collected and reported in a consistent way (the definition of
pm_relax() is changed, which is harmless, because this function is
not called directly by anyone yet). Introduce new wakeup-related
sysfs device attributes in /sys/devices/.../power for reporting the
device wakeup statistics.
Change the global wakeup events counters event_count and
events_in_progress into atomic variables, so that it is not necessary
to acquire a global spinlock in pm_wakeup_event(), pm_stay_awake()
and pm_relax(), which should allow us to avoid lock contention in
these functions on SMP systems with many wakeup devices.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
The default hibernation image size is currently hard coded and euqal
to 500 MB, which is not a reasonable default on many contemporary
systems. Make it equal 2/5 of the total RAM size (this is slightly
below the maximum, i.e. 1/2 of the total RAM size, and seems to be
generally suitable).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: M. Vefa Bicakci <bicave@superonline.com>
One comment in hibernate_preallocate_memory() is wrong, so fix it and
add one more comment to clarify the meaning of the fixed one.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Although we need the PM workqueue to be freezable, we don't need it
to be singlethread. Also, the number of concurrent work items
running on a single CPU need not be constrained. For these reasons
use alloc_workqueue() directly, with suitable arguments, instead of
create_freezeable_workqueue(), to create the runtime PM workqueue.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Tejun Heo <tj@kernel.org>
Fix the following build warning:
warning: (PM_SLEEP_SMP && SMP && (ARCH_SUSPEND_POSSIBLE || \
ARCH_HIBERNATION_POSSIBLE) && PM_SLEEP) selects HOTPLUG_CPU which \
has unmet direct dependencies (SMP && HOTPLUG)
by selecting HOTPLUG along with CPU_HOTPLUG.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Compress hibernation image with LZO in order to save on I/O and
therefore time to hibernate/thaw.
[rjw: Added hibernate=nocompress command line option instead of just
nocompress which would be confusing, fixed a couple of compiler
warnings, fixed kerneldoc comments, minor cleanups.]
Signed-off-by: Bojan Smojver <bojan@rexursive.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
sysctl check complains with a WARN() when proc_doulongvec_minmax() or
proc_doulongvec_ms_jiffies_minmax() are used by a vector of longs (with
more than one element), with no min or max value specified.
This is unexpected, given we had a bug on this min/max handling :)
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
The config option used by archs to let the build system know that
the C version of the recordmcount works for said arch is currently
called HAVE_C_MCOUNT_RECORD which enables BUILD_C_RECORDMCOUNT. To
be more consistent with the name that all archs may use, it has been
renamed to HAVE_C_RECORDMCOUNT. This will be less confusing since
we are building a C recordmcount and not a mcount_record.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds the support for the C version of recordmcount and
compile times show ~ 12% improvement.
After verifying this works, other archs can add:
HAVE_C_MCOUNT_RECORD
in its Kconfig and it will use the C version of recordmcount
instead of the perl version.
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The race is described as follows:
CPU X CPU Y
remove_hrtimer
// state & QUEUED == 0
timer->state = CALLBACK
unlock timer base
timer->f(n) //very long
hrtimer_start
lock timer base
remove_hrtimer // no effect
hrtimer_enqueue
timer->state = CALLBACK |
QUEUED
unlock timer base
hrtimer_start
lock timer base
remove_hrtimer
mode = INACTIVE
// CALLBACK bit lost!
switch_hrtimer_base
CALLBACK bit not set:
timer->base
changes to a
different CPU.
lock this CPU's timer base
The bug was introduced with commit ca109491f (hrtimer: removing all ur
callback modes) in 2.6.29
[ tglx: Feed new state via local variable and add a comment. ]
Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: akpm@linux-foundation.org
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20101012142351.8485.21823.stgit@dungbeetle.mtv.corp.google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
Targeted preemption latency and minimal preemption granularity
for CPU-bound tasks have been changed.
This patch updates the comments about these values.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20101014160913.eb24fef4.yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert futex_requeue() function parameters to use @name
kernel-doc notation and add @fshared & @cmpval to prevent
kernel-doc warnings.
Add @list to struct futex_q.
Fix a few typos.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20101013110234.89b06043.randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix selftest to clear flags field for reusing probes
because the flags field can be modified by Kprobes.
This also set NULL to kprobe.addr instead of 0.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20101014031024.4100.50107.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix
kernel/trace/trace_functions_graph.c: In function ‘trace_print_graph_duration’:
kernel/trace/trace_functions_graph.c:652: warning: comparison of distinct pointer types lacks a cast
when building 36-rc6 on a 32-bit due to the strict type check failing
in the min() macro.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Chase Douglas <chase.douglas@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20100929080823.GA13595@liondog.tnic>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
This option can be set to verify the full conversion to the new chip
functions. Fix the fallout of the patch rework, so the core code
compiles and works with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Time stamps for the ring buffer are created by the difference between
two events. Each page of the ring buffer holds a full 64 bit timestamp.
Each event has a 27 bit delta stamp from the last event. The unit of time
is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
happen more than 134 milliseconds apart, a time extend is inserted
to add more bits for the delta. The time extend has 59 bits, which
is good for ~18 years.
Currently the time extend is committed separately from the event.
If an event is discarded before it is committed, due to filtering,
the time extend still exists. If all events are being filtered, then
after ~134 milliseconds a new time extend will be added to the buffer.
This can only happen till the end of the page. Since each page holds
a full timestamp, there is no reason to add a time extend to the
beginning of a page. Time extends can only fill a page that has actual
data at the beginning, so there is no fear that time extends will fill
more than a page without any data.
When reading an event, a loop is made to skip over time extends
since they are only used to maintain the time stamp and are never
given to the caller. As a paranoid check to prevent the loop running
forever, with the knowledge that time extends may only fill a page,
a check is made that tests the iteration of the loop, and if the
iteration is more than the number of time extends that can fit in a page
a warning is printed and the ring buffer is disabled (all of ftrace
is also disabled with it).
There is another event type that is called a TIMESTAMP which can
hold 64 bits of data in the theoretical case that two events happen
18 years apart. This code has not been implemented, but the name
of this event exists, as well as the structure for it. The
size of a TIMESTAMP is 16 bytes, where as a time extend is only
8 bytes. The macro used to calculate how many time extends can fit on
a page used the TIMESTAMP size instead of the time extend size
cutting the amount in half.
The following test case can easily trigger the warning since we only
need to have half the page filled with time extends to trigger the
warning:
# cd /sys/kernel/debug/tracing/
# echo function > current_tracer
# echo 'common_pid < 0' > events/ftrace/function/filter
# echo > trace
# echo 1 > trace_marker
# sleep 120
# cat trace
Enabling the function tracer and then setting the filter to only trace
functions where the process id is negative (no events), then clearing
the trace buffer to ensure that we have nothing in the buffer,
then write to trace_marker to add an event to the beginning of a page,
sleep for 2 minutes (only 35 seconds is probably needed, but this
guarantees the bug), and then finally reading the trace which will
trigger the bug.
This patch fixes the typo and prevents the false positive of that warning.
Reported-by: Hans J. Koch <hjk@linutronix.de>
Tested-by: Hans J. Koch <hjk@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It needs to be investigated whether it can be replaced by a real
mutex, but that needs more thought.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
LKML-Reference: <20100907125057.179587334@linutronix.de>
The allocator functions are now called outside of preempt disabled
regions. Switch to GFP_KERNEL.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The move_irq_desc() function was only used due to the problem that the
allocator did not free the old descriptors. So the descriptors had to
be moved in create_irq_nr(). That's history.
The code would have never been able to move active interrupt
descriptors on affinity settings. That can be done in a completely
different way w/o all this horror.
Remove all of it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Use the cleanup functions of the dynamic allocator. No need to have
separate implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
This function should have not been there in the first place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
sparse irq sets up NR_IRQS_LEGACY irq descriptors and archs then go
ahead and allocate more.
Use the unused return value of arch_probe_nr_irqs() to let the
architecture return the number of early allocations. Fix up all users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Make irq_to_desc_alloc_node() a wrapper around the new allocator.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Mark a range of interrupts as allocated. In the SPARSE_IRQ=n case we
need this to update the bitmap for the legacy irqs so the enumerator
via irq_get_next_irq() works.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
/proc/irq never removes any entries, but when irq descriptors can be
freed for real this is necessary. Otherwise we'd reference a freed
descriptor in /proc/irq/N
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The current sparse_irq allocator has several short comings due to
failures in the design or the lack of it:
- Requires iteration over the number of active irqs to find a free slot
(Some architectures have grown their own workarounds for this)
- Removal of entries is not possible
- Racy between create_irq_nr and destroy_irq (plugged by horrible
callbacks)
- Migration of active irq descriptors is not possible
- No bulk allocation of irq ranges
- Sprinkeled irq_desc references all over the place outside of kernel/irq/
(The previous chip functions series is addressing this issue)
Implement a sane allocator which fixes the above short comings (though
migration of active descriptors needs a full tree wide cleanup of the
direct and mostly unlocked access to irq_desc).
The new allocator still uses a radix_tree, but uses a bitmap for
keeping track of allocated irq numbers. That allows:
- Fast lookup of a free slot
- Allows the removal of descriptors
- Prevents the create/destroy race
- Bulk allocation of consecutive irq ranges
- Basic design is ready for migration of life descriptors after
further cleanups
The bitmap is also used in the SPARSE_IRQ=n case for lookup and
raceless (de)allocation of irq numbers. So it removes the requirement
for looping through the descriptor array to find slots.
Right now it uses sparse_irq_lock to protect the bitmap and the radix
tree, but after cleaning up all users we should be able convert that
to a mutex and to switch the radix_tree and decriptor allocations to
GFP_KERNEL.
[ Folded in a bugfix from Yinghai Lu ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Arch code sets it's own irq_desc.status flags right after boot and for
dynamically allocated interrupts. That might involve iterating over a
huge array.
Allow ARCH_IRQ_INIT_FLAGS to set separate flags aside of IRQ_DISABLED
which is the default.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The statistics accessor is only used by proc/stats and
show_interrupts(). Both are compiled in.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
early_init_irq_lock_class() is called way before anything touches the
irq descriptors. In case of SPARSE_IRQ=y this is a NOP operation
because the radix tree is empty at this point. For the SPARSE_IRQ=n
case it's sufficient to set the lock class in early_init_irq().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
kernel/irq/handle.c has become a dumpground for random code in random
order. Split out the irq descriptor management and the dummy irq_chip
implementation into separate files. Cleanup the include maze while at
it.
No code change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Get the data structure from the core and provide inline wrappers to
access the irq_data members.
Provide accessor inlines for irq_data as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Provide a irq_desc.status modifier function to cleanup the direct
access to irq_desc in arch and driver code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
perf events: repair incorrect use of copy_from_user
This makes the perf_event_period() return 0 instead of
-EFAULT on success.
Signed-off-by: John Blackwood<john.blackwood@ccur.com>
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100928220311.GA18145@tsunami.ccur.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reason for merge:
Forward-port urgent change to arch/x86/mm/srat_64.c to the memblock tree.
Resolved Conflicts:
arch/x86/mm/srat_64.c
Originally-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Introduce perf_pmu_name() helper function that returns the name of the
pmu. This gives us a generic way to get the name of a pmu regardless of
how an architecture identifies it internally.
Signed-off-by: Matt Fleming <matt@console-pimps.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark
WQ_RESCUER as internal and replace all external WQ_RESCUER usages to
WQ_MEM_RECLAIM.
This makes the API users express the intent of the workqueue instead
of indicating the internal mechanism used to guarantee forward
progress. This is also to make it cleaner to add more semantics to
WQ_MEM_RECLAIM. For example, if deemed necessary, memory reclaim
workqueues can be made highpri.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING
and could return %false with highpri work pending. This could lead to
late execution of a highpri work which was delayed due to @max_active
throttling if other works are actively consuming CPU cycles.
For example, the following could happen.
1. Work W0 which burns CPU cycles.
2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1.
3. W1 starts executing and W2 is put to delayed queue. W0 and W1 are
both runnable.
4. W1 finishes which puts W2 to pending queue but keep_working()
incorrectly returns %false and the worker goes to sleep.
5. W0 finishes and W2 starts execution.
With this patch applied, W2 starts execution as soon as W1 finishes.
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
HWPOISON: Stop shrinking at right page count
HWPOISON: Report correct address granuality for AO huge page errors
HWPOISON: Copy si_addr_lsb to user
page-types.c: fix name of unpoison interface
When proc_doulongvec_minmax() is used with an array of longs, and no
min/max check requested (.extra1 or .extra2 being NULL), we dereference a
NULL pointer for the second element of the array.
Noticed while doing some changes in network stack for the "16TB problem"
Fix is to not change min & max pointers in __do_proc_doulongvec_minmax(),
so that all elements of the vector share an unique min/max limit, like
proc_dointvec_minmax().
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an optional test to force long-term preemption of RCU read-side
critical sections, controlled by new test_boost, test_boost_interval,
and test_boost_duration module parameters. This is to be used to
test RCU priority boosting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This addresses the following RCU lockdep splat:
[0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
[0.052999] lockdep: fixing up alternatives.
[0.054105]
[0.054106] ===================================================
[0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
[0.054999] ---------------------------------------------------
[0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
[0.054999]
[0.054999] other info that might help us debug this:
[0.054999]
[0.054999]
[0.054999] rcu_scheduler_active = 1, debug_locks = 1
[0.054999] 3 locks held by swapper/1:
[0.054999] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
[0.054999] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
[0.054999] #2: (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
[0.054999]
[0.054999] stack backtrace:
[0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
[0.054999] Call Trace:
[0.054999] [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
[0.054999] [<ffffffff810325c3>] task_group+0x7b/0x8a
[0.054999] [<ffffffff810325e5>] set_task_rq+0x13/0x40
[0.054999] [<ffffffff814be39a>] init_idle+0xd2/0x113
[0.054999] [<ffffffff814be78a>] fork_idle+0xb8/0xc7
[0.054999] [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
[0.054999] [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
[0.054999] [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
[0.054999] [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
[0.054999] [<ffffffff814be876>] _cpu_up+0xac/0x127
[0.054999] [<ffffffff814be946>] cpu_up+0x55/0x6a
[0.054999] [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
[0.054999] [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
[0.054999] [<ffffffff814c353c>] ? restore_args+0x0/0x30
[0.054999] [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
[0.054999] [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
[0.056074] Booting Node 0, Processors #1lockdep: fixing up alternatives.
[0.130045] #2lockdep: fixing up alternatives.
[0.203089] #3 Ok.
[0.275286] Brought up 4 CPUs
[0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
The cgroup_subsys_state structures referenced by idle tasks are never
freed, because the idle tasks should be part of the root cgroup,
which is not removable.
The problem is that while we do in-fact hold rq->lock, the newly spawned
idle thread's cpu is not yet set to the correct cpu so the lockdep check
in task_group():
lockdep_is_held(&task_rq(p)->lock)
will fail.
But this is a chicken and egg problem. Setting the CPU's runqueue requires
that the CPU's runqueue already be set. ;-)
So insert an RCU read-side critical section to avoid the complaint.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value
due to the caller didn't hold the root_rcu/rnp node's lock. Although
use without ACCESS_ONCE() is safe due to the value loaded being used
but once, the ACCESS_ONCE() is a good documentation aid -- the variables
are being loaded without the services of a lock.
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
CC: Dipankar Sarma <dipankar@in.ibm.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
>
> other info that might help us debug this:
>
> rcu_scheduler_active = 1, debug_locks = 1
> 1 lock held by ifup/23517:
> #0: (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
>
> stack backtrace:
> Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
> Call Trace:
> [<c075e219>] ? printk+0xf/0x16
> [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
> [<c0426854>] task_group+0x6d/0x79
> [<c042686e>] set_task_rq+0xe/0x57
> [<c042f79e>] task_fork_fair+0x57/0x108
> [<c042e965>] sched_fork+0x82/0xf9
> [<c04334b3>] copy_process+0x569/0xe8e
> [<c0433ef0>] do_fork+0x118/0x262
> [<c076302f>] ? do_page_fault+0x16a/0x2cf
> [<c044b80c>] ? up_read+0x16/0x2a
> [<c04085ae>] sys_clone+0x1b/0x20
> [<c04030a5>] ptregs_clone+0x15/0x30
> [<c0402f1c>] ? sysenter_do_call+0x12/0x38
Here a newly created task is having its runqueue assigned. The new task
is not yet on the tasklist, so cannot go away. This is therefore a false
positive, suppress with an RCU read-side critical section.
Reported-by: Ben Greear <greearb@candelatech.com
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Ben Greear <greearb@candelatech.com
The original hwpoison code added a new siginfo field si_addr_lsb to
pass the granuality of the fault address to user space. Unfortunately
this field was never copied to user space. Fix this here.
I added explicit checks for the MCEERR codes to avoid having
to patch all potential callers to initialize the field.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
As suggested by Linus, push the irqs_disabled() down to the
rcu_read_lock_bh_held() level so that all callers get the benefit
of the correct check.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
With all the recent module loading cleanups, we've minimized the code
that sits under module_mutex, fixing various deadlocks and making it
possible to do most of the module loading in parallel.
However, that whole conversion totally missed the rather obscure code
that adds a new module to the list for BUG() handling. That code was
doubly obscure because (a) the code itself lives in lib/bugs.c (for
dubious reasons) and (b) it gets called from the architecture-specific
"module_finalize()" rather than from generic code.
Calling it from arch-specific code makes no sense what-so-ever to begin
with, and is now actively wrong since that code isn't protected by the
module loading lock any more.
So this commit moves the "module_bug_{finalize,cleanup}()" calls away
from the arch-specific code, and into the generic code - and in the
process protects it with the module_mutex so that the list operations
are now safe.
Future fixups:
- move the module list handling code into kernel/module.c where it
belongs.
- get rid of 'module_bug_list' and just use the regular list of modules
(called 'modules' - imagine that) that we already create and maintain
for other reasons.
Reported-and-tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Adrian Bunk <bunk@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two tracepoints allow tracking when and how a work is queued and
activated. This patch is based on Frederic's patch to add queue_work
trace point.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Define workqueue_work event class and use it for workqueue_execute_end
trace point. Also, move trace/events/workqueue.h include downwards
such that all struct definitions are visible to it. This is to
prepare for more tracepoints and doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
The BKL is only used in remount_fs and get_sb that are both protected by
the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
BKL entirely.
Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.
I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.
do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.
Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.
[arnd: do not add the BKL to those file systems that already
don't use it elsewhere]
Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
This option covers now the old chip functions and the irq_desc data
fields which are moving to struct irq_data. More stuff will follow.
Pretty handy for testing a conversion, whether something broke or not.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
This patch fixes an error in perf_event_open() when the pid
provided by the user is invalid. find_lively_task_by_vpid()
does not return NULL on error but an error code. Without the
fix the error code was silently passed to find_get_context()
which would eventually cause a invalid pointer dereference.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: paulus@samba.org
Cc: davem@davemloft.net
Cc: fweisbec@gmail.com
Cc: perfmon2-devel@lists.sf.net
Cc: eranian@gmail.com
Cc: robert.richter@amd.com
LKML-Reference: <4ca9a5d1.e8e9d80a.3dbb.ffff8f2e@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function retrigger() until the migration is complete
and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121843.025801092@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function set_wake() until the migration is complete
and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.927527393@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function set_type() until the migration is complete
and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.832261548@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function set_affinity() until the migration is
complete and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.732894108@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function startup() until the migration is complete and
the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.635152961@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip functions disable() and shutdown() until the
migration is complete and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.532070631@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function enable() until the migration is complete and
the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.437159182@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function eoi() until the migration is complete and
the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.339657617@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function mask_ack() until the migration is complete
and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.240806983@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function ack() until the migration is complete and
the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.142624725@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function unmask() until the migration is complete
and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121842.043608928@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip function mask() until the migration is complete and
the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121841.940355859@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Wrap the old chip functions for bus_lock/bus_sync_unlock until the
migration is complete and the old chip functions are removed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121841.842536121@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The compat functions go away when the core code is converted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Convert all references in the core code to orq, chip, handler_data,
chip_data, msi_desc, affinity to irq_data.*
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Low level chip functions need access to irq_desc->handler_data,
irq_desc->chip_data and irq_desc->msi_desc. We hand down the irq
number to the low level functions, so they need to lookup irq_desc.
With sparse irq this means a radix tree lookup.
We could hand down irq_desc itself, but low level chip functions have
no need to fiddle with it directly and we want to restrict access to
irq_desc further.
Preparatory patch for new chip functions.
Note, that the ugly anon union/struct is there to avoid a full tree
wide clean up for now. This is not going to last 3 years like __do_IRQ()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121841.645542300@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The generic irq Kconfig options are copied around all archs. Provide a
generic Kconfig file which can be included.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20100927121843.217333624@linutronix.de>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
The kfifo_dma family of functions use sg_mark_end() on the last element in
their scatterlist. This forces use of a fresh scatterlist for each DMA
operation, which makes recycling a single scatterlist impossible.
Change the behavior of the kfifo_dma functions to match the usage of the
dma_map_sg function. This means that users must respect the returned
nents value. The sample code is updated to reflect the change.
This bug is trivial to cause: call kfifo_dma_in_prepare() such that it
prepares a scatterlist with a single entry comprising the whole fifo.
This is the case when you map the entirety of a newly created empty fifo.
This causes the setup_sgl() function to mark the first scatterlist entry
as the end of the chain, no matter what comes after it.
Afterwards, add and remove some data from the fifo such that another call
to kfifo_dma_in_prepare() will create two scatterlist entries. It returns
nents=2. However, due to the previous sg_mark_end() call, sg_is_last()
will now return true for the first scatterlist element. This causes the
sample code to print a single scatterlist element when it should print
two.
By removing the call to sg_mark_end(), we make the API as similar as
possible to the DMA mapping API. All users are required to respect the
returned nents.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Cc: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3 years transition phase is enough. Cleanup the last users and remove
the cruft.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Leo Chen <leochen@broadcom.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Chris Zankel <chris@zankel.net>
The current tracing data is not sufficient to deduce the average time
that a callback spends waiting for a grace period to end. Add three
per-CPU counters recording the number of callbacks invoked (ci), the
number of callbacks orphaned (co), and the number of callbacks adopted
(ca). Given the existing callback queue length (ql), the average wait
time in absence of CPU hotplug operations is ql/ci. The units of wait
time will be in terms of the duration over which ci was measured.
In the presence of CPU hotplug operations, there is room for argument,
but ql/(ci-co+ca) won't steer you too far wrong.
Also fixes a typo called out by Lucas De Marchi <lucas.de.marchi@gmail.com>.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CONFIG_DEBUG_LOCK_ALLOC ifdef isn't necessary at this point, because it is
checked in an outer ifdef level already and has no effect here.
Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The below bug in fork led to the rmap walk finding the parent huge-pmd
twice instead of just once, because the anon_vma_chain objects of the
child vma still point to the vma->vm_mm of the parent.
The patch fixes it by making the rmap walk accurate during fork. It's not
a big deal normally but it worth being accurate considering the cost is
the same.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make use of the jump label infrastructure for tracepoints.
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <a9ba2056e2c9cf332c3c300b577463ce66ff23a8.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a jump_label_text_reserved(void *start, void *end), so that other
pieces of code that want to modify kernel text, can first verify that
jump label has not reserved the instruction.
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <06236663a3a7b1c1f13576bb9eccb6d9c17b7bfe.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Initialize the workqueue data structures *before* they are registered
so that they are ready for callbacks.
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <e3a3383fc370ac7086625bebe89d9480d7caf372.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
base patch to implement 'jump labeling'. Based on a new 'asm goto' inline
assembly gcc mechanism, we can now branch to labels from an 'asm goto'
statment. This allows us to create a 'no-op' fastpath, which can subsequently
be patched with a jump to the slowpath code. This is useful for code which
might be rarely used, but which we'd like to be able to call, if needed.
Tracepoints are the current usecase that these are being implemented for.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <ee8b3595967989fdaf84e698dc7447d315ce972a.1284733808.git.jbaron@redhat.com>
[ cleaned up some formating ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix nohz balance kick
sched: Fix user time incorrectly accounted as system time on 32-bit
Add a tracepoint that shows the priority of a task being boosted
via priority inheritance.
Cc: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a high priority task is waking up on a CPU that is running a
lower priority task that is bound to a CPU, see if we can move the
high RT task to another CPU first. Note, if all other CPUs are
running higher priority tasks than the CPU bounded current task,
then it will be preempted regardless.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.888922071@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.
When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.
This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.
But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.
To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.
All threads did the following loop:
static void grab_lock(long id, int iter, int l)
{
ftrace_write("thread %ld iter %d, taking lock %d\n",
id, iter, l);
pthread_mutex_lock(&locks[l]);
ftrace_write("thread %ld iter %d, took lock %d\n",
id, iter, l);
busy_loop(nr_tasks - id);
ftrace_write("thread %ld iter %d, unlock lock %d\n",
id, iter, l);
pthread_mutex_unlock(&locks[l]);
}
void *start_task(void *id)
{
[...]
while (!done) {
for (l = 0; l < nr_locks; l++) {
grab_lock(id, i, l);
ftrace_write("thread %ld iter %d sleeping\n",
id, i);
ms_sleep(id);
}
i++;
}
[...]
}
The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.
The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.
At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.
Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:
Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 53 3220 1470 98
1: 562 773 724 98
2: 752 933 1375 98
3: 749 39 697 98
4: 758 5 515 98
5: 764 2 679 99
6: 761 2 535 99
7: 757 3 346 99
total: 5156 4977 6341 787
Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.
By letting higher priority tasks bump the lower prio task from the
CPU, things changed a bit:
Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 37 2835 1937 98
1: 666 1821 1865 98
2: 654 1003 1385 98
3: 664 635 973 99
4: 698 197 352 99
5: 703 101 159 99
6: 708 1 75 99
7: 713 1 2 99
total: 4843 6594 6748 789
The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).
Notice that the # of iterations did not change either.
The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:
Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 56 3101 1892 95
1: 594 713 937 95
2: 625 188 618 95
3: 628 4 491 96
4: 640 7 468 96
5: 631 2 501 96
6: 641 1 466 96
7: 643 2 497 96
total: 4458 4018 5870 765
Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on
wakeup we see:
Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 115 3439 2782 98
1: 633 1354 1583 99
2: 652 919 1218 99
3: 645 713 934 99
4: 690 3 3 99
5: 694 1 4 99
6: 720 3 4 99
7: 747 0 1 100
Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.
Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.
One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.605460343@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The per-pmu per-cpu context patch converted things from
get_cpu_var() to this_cpu_ptr(), but that only works if
rcu_read_lock() actually disables preemption, and since
there is no such guarantee, we need to fix that.
Use the newly introduced {get,put}_cpu_ptr().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
LKML-Reference: <20100917093009.308453028@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's a situation where the nohz balancer will try to wake itself:
cpu-x is idle which is also ilb_cpu
got a scheduler tick during idle
and the nohz_kick_needed() in trigger_load_balance() checks for
rq_x->nr_running which might not be zero (because of someone waking a
task on this rq etc) and this leads to the situation of the cpu-x
sending a kick to itself.
And this can cause a lockup.
Avoid this by not marking ourself eligible for kicking.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement flush[_delayed]_work_sync(). These are flush functions
which also make sure no CPU is still executing the target work from
earlier queueing instances. These are similar to
cancel[_delayed]_work_sync() except that the target work item is
flushed instead of cancelled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Factor out start_flush_work() from flush_work(). start_flush_work()
has @wait_executing argument which controls whether the barrier is
queued only if the work is pending or also if executing. As
flush_work() needs to wait for execution too, it uses %true.
This commit doesn't cause any behavior difference. start_flush_work()
will be used to implement flush_work_sync().
Signed-off-by: Tejun Heo <tj@kernel.org>
Make the following cleanup changes.
* Relocate flush/cancel function prototypes and definitions.
* Relocate wait_on_cpu_work() and wait_on_work() before
try_to_grab_pending(). These will be used to implement
flush_work_sync().
* Make all flush/cancel functions return bool instead of int.
* Update wait_on_cpu_work() and wait_on_work() to return %true if they
actually waited.
* Add / update comments.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
queue_lock/unlock/me() and unqueue_me_pi() grab/release spinlocks
but are missing proper annotations. Add them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <1284468228-8723-3-git-send-email-namhyung@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
@uaddr and @uaddr2 fields in restart_block.futex are user
pointers. Add __user and remove unnecessary casts.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <1284468228-8723-2-git-send-email-namhyung@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Sparse complains:
kernel/futex.c:2495:59: warning: incorrect type in argument 3 (different signedness)
Make 3rd argument of fetch_robust_entry() 'unsigned int'.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <1284468228-8723-1-git-send-email-namhyung@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Revert the timer per cpu-context timers because of unfortunate
nohz interaction. Fixing that would have been somewhat ugly, so
go back to driving things from the regular tick. Provide a
jiffies interval feature for people who want slower rotations.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20100917093009.519845633@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the right cpu-context.. spotted by preempt warning on
hot-unplug
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
LKML-Reference: <20100917093009.461794357@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>