The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it. This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Because rcu_preempt_check_callbacks()'s argument is guaranteed to
always be the current CPU, drop the argument and replace per_cpu()
with __this_cpu_read().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Because rcu_pending()'s argument is guaranteed to always be the current
CPU, drop the argument and replace per_cpu_ptr() with this_cpu_ptr().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument was kept around on the off-chance that RCU might
offload scheduler-clock interrupts. However, this offload approach
has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
from qualifying CPUs. It is therefore time to remove the "cpu" argument
to rcu_check_callbacks(), which this commit does.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU. This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory. This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.
Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp). Remove the parameter if constant and calculate the
pointer in function.
This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.
Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
This patch migrates swsusp_show_speed and its callers to using ktime_t instead
of 'struct timeval' which suffers from the y2038 problem.
Changes to swsusp_show_speed:
- use ktime_t for start and stop times
- pass start and stop times by value
Calling functions affected:
- load_image
- load_image_lzo
- save_image
- save_image_lzo
- hibernate_preallocate_memory
Design decisions:
- use ktime_t to preserve same granularity of reporting as before
- use centisecs logic as before to avoid 'div by zero' issues caused by
using seconds and nanoseconds directly
- use monotonic time (ktime_get()) since we only care about elapsed time.
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Fix a crash on r8a7791/koelsch during resume from system suspend
caused by a recent cpufreq-dt commit (Geert Uytterhoeven).
- Fix an MFD enumeration problem introduced by a recent commit
adding ACPI support to the MFD subsystem that exposed a weakness
in the ACPI core causing ACPI enumeration to be applied to all
devices associated with one ACPI companion object, although it
should be used for one of them only (Mika Westerberg).
- Fix an ACPI EC regression introduced during the 3.17 cycle
causing some Samsung laptops to misbehave as a result of a
workaround targeted at some Acer machines. That includes
a revert of a commit that went too far and a quirk for the
Acer machines in question. From Lv Zheng.
- Fix a regression in the system suspend error code path introduced
during the 3.15 cycle that causes it to fail to take errors from
asychronous execution of "late" suspend callbacks into account
(Imre Deak).
- Fix a long-standing bug in the hibernation resume error code path
that fails to roll back everything correcty on "freeze" callback
errors and leaves some devices in a "suspended" state causing more
breakage to happen subsequently (Imre Deak).
- Make the cpufreq-dt driver disable operation performance points
that are not supported by the VR connected to the CPU voltage
plane with acceptable tolerance instead of constantly failing
voltage scaling later on (Lucas Stach).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUVAuPAAoJEILEb/54YlRxGfQP/0nFfTqyuDN8cPA2qRIzDIoi
8PTOzlhrRuzUlpMkYdsDijxwFcK2/59LomwtuAKHi7309N6UzUa8vAkb8WrzpY7m
XUU+fhsLEkDnEczMfgmbP5ljtP75eJSSWRO0WIBuk4k79qcsutLNtgGpJV7feYSv
+t7OE9DrBPM8lSpBKM/4qs5gnXzdaWmi4xGH7upQWyxAC6RG9GosKdDUZxVxSJQt
oy/y0O4oxwyjg+8EvPwd22JtoFJ6axoEwCJXXlkn7NbIQNGtxrMR9zcMglsuOklg
bG93g1xJl4YCwLXV8sKfPU2kQkQ1ISY3rYIkwIjvBNIY4QFsQpCg3GYt08OJI0bO
4wDD7kH8C51aD9Zfi9luCdE4MsMyGB7SeNvQJul5uMujuG9ZeI61a8d7P6fmXu5X
lk+GeNl/rMujaESwqQlNgm3DvSYfc5FFEDC6F4Wcu4koomSlJwj//lMlOg2ajIgz
p5En6FeC8yGTuobGqo2dT7yYjmxm+kdX+gTStsto+hkxWA7beNjI1iXXWwPrQa/F
7pzneSrdbTZVdzZ1F9eR9AcGljhRMLBxs2XembXgkviCv+IVjw4qHWWKveDQKkhG
CVtcd3jrFSRHeAaqVNnbsoMu2nOLRY2W+f2+FNEfYKc+13aDJYm7pyAOIjujY7ns
Q1jSP7ZZQBVlxP5j5W5x
=g4QU
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are fixes received after my previous pull request plus one that
has been in the works for quite a while, but its previous version
caused problems to happen, so it's been deferred till now.
Fixed are two recent regressions (MFD enumeration and cpufreq-dt),
ACPI EC regression introduced in 3.17, system suspend error code path
regression introduced in 3.15, an older bug related to recovery from
failing resume from hibernation and a cpufreq-dt driver issue related
to operation performance points.
Specifics:
- Fix a crash on r8a7791/koelsch during resume from system suspend
caused by a recent cpufreq-dt commit (Geert Uytterhoeven).
- Fix an MFD enumeration problem introduced by a recent commit adding
ACPI support to the MFD subsystem that exposed a weakness in the
ACPI core causing ACPI enumeration to be applied to all devices
associated with one ACPI companion object, although it should be
used for one of them only (Mika Westerberg).
- Fix an ACPI EC regression introduced during the 3.17 cycle causing
some Samsung laptops to misbehave as a result of a workaround
targeted at some Acer machines. That includes a revert of a commit
that went too far and a quirk for the Acer machines in question.
From Lv Zheng.
- Fix a regression in the system suspend error code path introduced
during the 3.15 cycle that causes it to fail to take errors from
asychronous execution of "late" suspend callbacks into account
(Imre Deak).
- Fix a long-standing bug in the hibernation resume error code path
that fails to roll back everything correcty on "freeze" callback
errors and leaves some devices in a "suspended" state causing more
breakage to happen subsequently (Imre Deak).
- Make the cpufreq-dt driver disable operation performance points
that are not supported by the VR connected to the CPU voltage plane
with acceptable tolerance instead of constantly failing voltage
scaling later on (Lucas Stach)"
* tag 'pm+acpi-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / EC: Fix regression due to conflicting firmware behavior between Samsung and Acer.
Revert "ACPI / EC: Add support to disallow QR_EC to be issued before completing previous QR_EC"
cpufreq: cpufreq-dt: Restore default cpumask_setall(policy->cpus)
PM / Sleep: fix recovery during resuming from hibernation
PM / Sleep: fix async suspend_late/freeze_late error handling
ACPI: Use ACPI companion to match only the first physical device
cpufreq: cpufreq-dt: disable unsupported OPPs
Pull networking fixes from David Miller:
"A bit has accumulated, but it's been a week or so since my last batch
of post-merge-window fixes, so...
1) Missing module license in netfilter reject module, from Pablo.
Lots of people ran into this.
2) Off by one in mac80211 baserate calculation, from Karl Beldan.
3) Fix incorrect return value from ax88179_178a driver's set_mac_addr
op, which broke use of it with bonding. From Ian Morgan.
4) Checking of skb_gso_segment()'s return value was not all
encompassing, it can return an SKB pointer, a pointer error, or
NULL. Fix from Florian Westphal.
This is crummy, and longer term will be fixed to just return error
pointers or a real SKB.
6) Encapsulation offloads not being handled by
skb_gso_transport_seglen(). From Florian Westphal.
7) Fix deadlock in TIPC stack, from Ying Xue.
8) Fix performance regression from using rhashtable for netlink
sockets. The problem was the synchronize_net() invoked for every
socket destroy. From Thomas Graf.
9) Fix bug in eBPF verifier, and remove the strong dependency of BPF
on NET. From Alexei Starovoitov.
10) In qdisc_create(), use the correct interface to allocate
->cpu_bstats, otherwise the u64_stats_sync member isn't
initialized properly. From Sabrina Dubroca.
11) Off by one in ip_set_nfnl_get_byindex(), from Dan Carpenter.
12) nf_tables_newchain() was erroneously expecting error pointers from
netdev_alloc_pcpu_stats(). It only returna a valid pointer or
NULL. From Sabrina Dubroca.
13) Fix use-after-free in _decode_session6(), from Li RongQing.
14) When we set the TX flow hash on a socket, we mistakenly do so
before we've nailed down the final source port. Move the setting
deeper to fix this. From Sathya Perla.
15) NAPI budget accounting in amd-xgbe driver was counting descriptors
instead of full packets, fix from Thomas Lendacky.
16) Fix total_data_buflen calculation in hyperv driver, from Haiyang
Zhang.
17) Fix bcma driver build with OF_ADDRESS disabled, from Hauke
Mehrtens.
18) Fix mis-use of per-cpu memory in TCP md5 code. The problem is
that something that ends up being vmalloc memory can't be passed
to the crypto hash routines via scatter-gather lists. From Eric
Dumazet.
19) Fix regression in promiscuous mode enabling in cdc-ether, from
Olivier Blin.
20) Bucket eviction and frag entry killing can race with eachother,
causing an unlink of the object from the wrong list. Fix from
Nikolay Aleksandrov.
21) Missing initialization of spinlock in cxgb4 driver, from Anish
Bhatt.
22) Do not cache ipv4 routing failures, otherwise if the sysctl for
forwarding is subsequently enabled this won't be seen. From
Nicolas Cavallari"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (131 commits)
drivers: net: cpsw: Support ALLMULTI and fix IFF_PROMISC in switch mode
drivers: net: cpsw: Fix broken loop condition in switch mode
net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
stmmac: pci: set default of the filter bins
net: smc91x: Fix gpios for device tree based booting
mpls: Allow mpls_gso to be built as module
mpls: Fix mpls_gso handler.
r8152: stop submitting intr for -EPROTO
netfilter: nft_reject_bridge: restrict reject to prerouting and input
netfilter: nft_reject_bridge: don't use IP stack to reject traffic
netfilter: nf_reject_ipv6: split nf_send_reset6() in smaller functions
netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functions
netfilter: nf_tables_bridge: update hook_mask to allow {pre,post}routing
drivers/net: macvtap and tun depend on INET
drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets
drivers/net: Disable UFO through virtio
net: skb_fclone_busy() needs to detect orphaned skb
gre: Use inner mac length when computing tunnel length
mlx4: Avoid leaking steering rules on flow creation error flow
net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
...
Pull scheduler fixes from Ingo Molnar:
"Various scheduler fixes all over the place: three SCHED_DL fixes,
three sched/numa fixes, two generic race fixes and a comment fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/dl: Fix preemption checks
sched: Update comments for CLONE_NEWNS
sched: stop the unbound recursion in preempt_schedule_context()
sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
sched/fair: Care divide error in update_task_scan_period()
sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
sched: Fix race between task_group and sched_task_group
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, plus on the kernel side:
- a revert for a newly introduced PMU driver which isn't complete yet
and where we ran out of time with fixes (to be tried again in
v3.19) - this makes up for a large chunk of the diffstat.
- compilation warning fixes
- a printk message fix
- event_idx usage fixes/cleanups"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Trivial typo fix for --demangle
perf tools: Fix report -F dso_from for data without branch info
perf tools: Fix report -F dso_to for data without branch info
perf tools: Fix report -F symbol_from for data without branch info
perf tools: Fix report -F symbol_to for data without branch info
perf tools: Fix report -F mispredict for data without branch info
perf tools: Fix report -F in_tx for data without branch info
perf tools: Fix report -F abort for data without branch info
perf tools: Make CPUINFO_PROC an array to support different kernel versions
perf callchain: Use global caching provided by libunwind
perf/x86/intel: Revert incomplete and undocumented Broadwell client support
perf/x86: Fix compile warnings for intel_uncore
perf: Fix typos in sample code in the perf_event.h header
perf: Fix and clean up initialization of pmu::event_idx
perf: Fix bogus kernel printk
perf diff: Add missing hists__init() call at tool start
Pull futex fixes from Ingo Molnar:
"This contains two futex fixes: one fixes a race condition, the other
clarifies shared/private futex comments"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Fix a race condition between REQUEUE_PI and task death
futex: Mention key referencing differences between shared and private futexes
Pull core fixes from Ingo Molnar:
"The tree contains two RCU fixes and a compiler quirk comment fix"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Make rcu_barrier() understand about missing rcuo kthreads
compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
rcu: More on deadlock between CPU hotplug and expedited grace periods
Pull timer fixes from Thomas Gleixner:
"As you requested in the rc2 release mail the timer department serves
you a few real bug fixes:
- Fix the probe logic of the architected arm/arm64 timer
- Plug a stack info leak in posix-timers
- Prevent a shift out of bounds issue in the clockevents core"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ARM/ARM64: arch-timer: fix arch_timer_probed logic
clockevents: Prevent shift out of bounds
posix-timers: Fix stack info leak in timer_create()
tracing system does not support that and without checks, it can cause
an oops to be reported.
Rabin Vincent added checks in the return code on syscall events to make
sure that the system call number is within the range that tracing
knows about, and if not, simply ignores the system call.
The system call tracing infrastructure needs to be rewritten to handle these
cases better, but for now, to keep from oopsing, this patch will do.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUUt+4AAoJEEjnJuOKh9ld3HgH/0RL7neY1tp05+v0GRvABmGr
6T47GEmZi9NiQOWjFC4SxNHLQSjpQX7eLD2CC6bljDfFpgKiIqarWHegEBUoBF9K
Dlg2jPpCwwwKbTXlAKTmv9QTGzvBEYyVZxhSC7mEbziV4Rbt7CVZJlogVdeYP5y0
4mWyHJg11Dt9SiZJCIv8sIrx2Xka2eX+Aq30dwYd9JGco3vVCH8NZ09ZgYBHaxIm
YrL6yUVnHP3nqKiEL4qCMUqUzexzdwUhrGPddLANaSRTWT+EAGYPD113bA76jAKc
cd3eaFwFkmCA0yfmjjBSb23FsPvKHc7j6BtZA6Q3uKPZUVlX+DyVNisUfEnaLQs=
=9NTR
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"ARM has system calls outside the NR_syscalls range, and the generic
tracing system does not support that and without checks, it can cause
an oops to be reported.
Rabin Vincent added checks in the return code on syscall events to
make sure that the system call number is within the range that tracing
knows about, and if not, simply ignores the system call.
The system call tracing infrastructure needs to be rewritten to handle
these cases better, but for now, to keep from oopsing, this patch will
do"
* tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/syscalls: Ignore numbers outside NR_syscalls' range
The file /sys/kernel/debug/tracing/eneabled_functions is used to debug
ftrace function hooks. Add to the output what function is being called
by the trampoline if the arch supports it.
Add support for this feature in x86_64.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current method of handling multiple function callbacks is to register
a list function callback that calls all the other callbacks based on
their hash tables and compare it to the function that the callback was
called on. But this is very inefficient.
For example, if you are tracing all functions in the kernel and then
add a kprobe to a function such that the kprobe uses ftrace, the
mcount trampoline will switch from calling the function trace callback
to calling the list callback that will iterate over all registered
ftrace_ops (in this case, the function tracer and the kprobes callback).
That means for every function being traced it checks the hash of the
ftrace_ops for function tracing and kprobes, even though the kprobes
is only set at a single function. The kprobes ftrace_ops is checked
for every function being traced!
Instead of calling the list function for functions that are only being
traced by a single callback, we can call a dynamically allocated
trampoline that calls the callback directly. The function graph tracer
already uses a direct call trampoline when it is being traced by itself
but it is not dynamically allocated. It's trampoline is static in the
kernel core. The infrastructure that called the function graph trampoline
can also be used to call a dynamically allocated one.
For now, only ftrace_ops that are not dynamically allocated can have
a trampoline. That is, users such as function tracer or stack tracer.
kprobes and perf allocate their ftrace_ops, and until there's a safe
way to free the trampoline, it can not be used. The dynamically allocated
ftrace_ops may, although, use the trampoline if the kernel is not
compiled with CONFIG_PREEMPT. But that will come later.
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.
Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a space between subj= and feature= fields to make them parsable.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
verifier keeps track of register state spilled to stack.
registers are 8-byte wide and always aligned, so instead of tracking them
in every byte-sized stack slot, use MAX_BPF_STACK / 8 array to track
spilled register state.
Though verifier runs in user context and its state freed immediately
after verification, it makes sense to reduce its memory usage.
This optimization reduces sizeof(struct verifier_state)
from 12464 to 1712 on 64-bit and from 6232 to 1112 on 32-bit.
Note, this patch doesn't change existing limits, which are there to bound
time and memory during verification: 4k total number of insns in a program,
1k number of jumps (states to visit) and 32k number of processed insn
(since an insn may be visited multiple times). Theoretical worst case memory
during verification is 1712 * 1k = 17Mbyte. Out-of-memory situation triggers
cleanup and rejects the program.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull two RCU fixes from Paul E. McKenney:
" - Complete the work of commit dd56af42bd (rcu: Eliminate deadlock
between CPU hotplug and expedited grace periods), which was
intended to allow synchronize_sched_expedited() to be safely
used when holding locks acquired by CPU-hotplug notifiers.
This commit makes the put_online_cpus() avoid the deadlock
instead of just handling the get_online_cpus().
- Complete the work of commit 35ce7f29a4 (rcu: Create rcuo
kthreads only for onlined CPUs), which was intended to allow
RCU to avoid allocating unneeded kthreads on systems where the
firmware says that there are more CPUs than are really present.
This commit makes rcu_barrier() aware of the mismatch, so that
it doesn't hang waiting for non-existent CPUs. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Found this in the message log on a s390 system:
BUG kmalloc-192 (Not tainted): Poison overwritten
Disabling lock debugging due to kernel taint
INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
__slab_alloc.isra.47.constprop.56+0x5f6/0x658
kmem_cache_alloc_trace+0x106/0x408
call_usermodehelper_setup+0x70/0x128
call_usermodehelper+0x62/0x90
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
__slab_free+0x94/0x560
kfree+0x364/0x3e0
call_usermodehelper_exec+0x110/0x1b8
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
There is a use-after-free bug on the subprocess_info structure allocated
by the user mode helper. In case do_execve() returns with an error
____call_usermodehelper() stores the error code to sub_info->retval, but
sub_info can already have been freed.
Regarding UMH_NO_WAIT, the sub_info structure can be freed by
__call_usermodehelper() before the worker thread returns from
do_execve(), allowing memory corruption when do_execve() failed after
exec_mmap() is called.
Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
call_usermodehelper_exec() to continue which then frees sub_info.
To fix this race the code needs to make sure that the call to
call_usermodehelper_freeinfo() is always done after the last store to
sub_info->retval.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following up the arm testing of gcov, turns out gcov on ARM64 works fine
as well. Only change needed is adding ARM64 to Kconfig depends.
Tested with qemu and mach-virt
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
contains checks for the case where CPUs are brought online out of
order, re-wiring the rcuo leader-follower relationships as needed.
Unfortunately, this rewiring was broken. This apparently went undetected
due to the tendency of systems to bring CPUs online in order. This commit
nevertheless fixes the rewiring.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads. This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after
TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU
and uses PREEMPT_RCU config option in its place.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Rename CONFIG_RCU_BOOST_PRIO to CONFIG_RCU_KTHREAD_PRIO and use this
value for both the per-CPU kthreads (rcuc/N) and the rcu boosting
threads (rcub/n).
Also, create the module_parameter rcutree.kthread_prio to be used on
the kernel command line at boot to set a new value (rcutree.kthread_prio=N).
Signed-off-by: Clark Williams <clark.williams@gmail.com>
[ paulmck: Ported to rcu/dev, applied Paul Bolle and Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
__cleanup_sighand() frees sighand without RCU grace period. This is
correct but this looks "obviously buggy" and constantly confuses the
readers, add the comments to explain how this works.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The kill_pid_info() can potentially loop indefinitely if tasks are created
and deleted sufficiently quickly, and if this happens, this function
will remain in a single RCU read-side critical section indefinitely.
This commit therefore exits the RCU read-side critical section on each
pass through the loop. Because a race must happen to retry the loop,
this should have no performance impact in the common case.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
ktime_get_real_seconds() is the replacement function for get_seconds()
returning the seconds portion of CLOCK_REALTIME in a time64_t. For
64bit the function is equivivalent to get_seconds(), but for 32bit it
protects the readout with the timekeeper sequence count. This is
required because 32-bit machines cannot access 64-bit tk->xtime_sec
variable atomically.
[tglx: Massaged changelog and added docbook comment ]
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/7adcfaa8962b8ad58785d9a2456c3f77d93c0ffb.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is the counterpart to get_seconds() based on CLOCK_MONOTONIC. The
use case for this interface are kernel internal coarse grained
timestamps which do neither require the nanoseconds fraction of
current time nor the CLOCK_REALTIME properties. Such timestamps can
currently only retrieved by calling ktime_get_ts64() and using the
tv_sec field of the returned timespec64. That's inefficient as it
involves the read of the clocksource, math operations and must be
protected by the timekeeper sequence counter.
To avoid the sequence counter protection we restrict the return value
to unsigned 32bit on 32bit machines. This covers ~136 years of uptime
and therefor an overflow is not expected to hit anytime soon.
To avoid math in the function we calculate the current seconds portion
of CLOCK_MONOTONIC when the timekeeper gets updated in
tk_update_ktime_data() similar to the CLOCK_REALTIME counterpart
xtime_sec.
[ tglx: Massaged changelog, simplified and commented the update
function, added docbook comment ]
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/da0b63f4bdf3478909f92becb35861197da3a905.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, synchronize_sched_expedited() sends IPIs to all online CPUs,
even those that are idle or executing in nohz_full= userspace. Because
idle CPUs and nohz_full= userspace CPUs are in extended quiescent states,
there is no need to IPI them in the first place. This commit therefore
avoids IPIing CPUs that are already in extended quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There are some RCU_BOOST-specific per-CPU variable declarations that
are needlessly defined under #ifdef in kernel/rcu/tree.c. This commit
therefore moves these declarations into a pre-existing #ifdef in
kernel/rcu/tree_plugin.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CONFIG_RCU_CPU_STALL_VERBOSE Kconfig parameter causes preemptible
RCU's CPU stall warnings to dump out any preempted tasks that are blocking
the current RCU grace period. This information is useful, and the default
has been CONFIG_RCU_CPU_STALL_VERBOSE=y for some years. It is therefore
time for this commit to remove this Kconfig parameter, so that future
kernel builds will always act as if CONFIG_RCU_CPU_STALL_VERBOSE=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
the accounting of the ftrace_ops trampoline logic. One was that the
old hash was not updated before calling the modify code for an ftrace_ops.
The second bug was what let the first bug go unnoticed, as the update would
check the current hash for all ftrace_ops (where it should only check the
old hash for modified ones). This let things work when only one ftrace_ops
was registered to a function, but could break if more than one was
registered depending on the order of the look ups.
The worse thing that can happen if this bug triggers is that the ftrace
self checks would find an anomaly and shut itself down.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUToYWAAoJEEjnJuOKh9ldfS8H/36CL5E+4itux9tIhf13Untj
FSi3EzvEdrTYu7IhdyRB6N7cp07g79jU3v40ZDLxDHzG2i4VLft/Z3uzIC0Z6mhL
kJZCCWpUTAKJO/UPFcenEZ7eiL+B+5QVOc1Oxcet0odG5HWkEZG62va/MrhB9k/7
uUNRqXNjg7w2rG0TK2qjcTHiPGJ9h7/wG9RgYktAIs27BUmip5sRS1IMyFL51Gpo
UNtIKGtG6/4hizdlHhWBuAa6ErM37GPskx3iP/45xiAu3J8SIbOk1FBe+4Xk+DZQ
hZK479hzlk6OU/M2vDJefG1d6zeQ7y00LMkUIAPiUEgayXAXpYX7UjV13CLQeGU=
=HrhJ
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace trampoline accounting fixes from Steven Rostedt:
"Adding the new code for 3.19, I discovered a couple of minor bugs with
the accounting of the ftrace_ops trampoline logic.
One was that the old hash was not updated before calling the modify
code for an ftrace_ops. The second bug was what let the first bug go
unnoticed, as the update would check the current hash for all
ftrace_ops (where it should only check the old hash for modified
ones). This let things work when only one ftrace_ops was registered
to a function, but could break if more than one was registered
depending on the order of the look ups.
The worse thing that can happen if this bug triggers is that the
ftrace self checks would find an anomaly and shut itself down"
* tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix checking of trampoline ftrace_ops in finding trampoline
ftrace: Set ops->old_hash on modifying what an ops hooks to
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a4, this could result in huge numbers of useless
rcuo kthreads being created.
It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.
It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.
Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().
So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.
Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.
In particular, preemption preserves task_struct::state.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.
Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.
Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a genuine bug in add_unformed_module(), we cannot use blocking
primitives inside a wait loop.
So rewrite the wait_event_interruptible() usage to use the fresh
wait_woken() stuff.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20140924082242.458562904@infradead.org
[ So this is probably complex to backport and the race wasn't reported AFAIK,
so not marked for -stable. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
smp_hotplug_thread::{setup,unpark} functions can sleep too, so be
consistent and do the same for all callbacks.
__might_sleep+0x74/0x80
kmem_cache_alloc_trace+0x4e/0x1c0
perf_event_alloc+0x55/0x450
perf_event_create_kernel_counter+0x2f/0x100
watchdog_nmi_enable+0x8d/0x160
watchdog_enable+0x45/0x90
smpboot_thread_fn+0xec/0x2b0
kthread+0xe4/0x100
ret_from_fork+0x7c/0xb0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.392279328@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
up calling potential sleeps before we reset it.
Not strictly a bug since we're guaranteed to exit the loop and not
call schedule(); put in annotations to quiet might_sleep().
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270
Call Trace:
[<ffffffff81694991>] dump_stack+0x4e/0x7a
[<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff810bca6e>] __might_sleep+0x7e/0x90
[<ffffffff811a1c15>] might_fault+0x55/0xb0
[<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
[<ffffffff8109a804>] do_wait+0x104/0x270
[<ffffffff8109b837>] SyS_wait4+0x77/0x100
[<ffffffff8169d692>] system_call_fastpath+0x16/0x1b
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: umgwanakikbuti@gmail.com
Cc: ilya.dryomov@inktank.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.
We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.
There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.
Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy reported that the current state of event_idx is rather confused.
So remove all but the x86_pmu implementation and change the default to
return 0 (the safe option).
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Cody P Schafer <dev@codyps.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Himangi Saraogi <himangi774@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sukadev@linux.vnet.ibm.com <sukadev@linux.vnet.ibm.com>
Cc: Thomas Huth <thuth@linux.vnet.ibm.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux390@de.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
can be used, and saves some cycles for pinned tasks.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to do balance during fork since SCHED_DEADLINE
tasks can't fork. This patch avoid the SD_BALANCE_FORK check.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.
This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:
- No check is performed when the user tries to attach a task to
an exlusive cpuset (recall that exclusive cpusets have an
associated maximum allowed bandwidth).
- Bandwidths of source and destination cpusets are not correctly
updated after a task is migrated between them.
This patch fixes both things at once, as they are opposite faces
of the same coin.
The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118):
| If rq has already had 2 or more pushable tasks and we try to add a
| pinned task then call of push_rt_task will just waste a time.
Just switched pinned task is not able to be pushed. If the rq has had
several dl tasks before they have already been considered as candidates
to be pushed (or pulled). This patch implements the same behavior as rt
class which introduced by commit 1044791755 ("sched/rt: Do not try to
push tasks if pinned task switches to RT").
Suggested-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_preempt_count() is pointless if preemption counter is per-cpu,
currently this is x86 only. It is only valid if the task is not
running, and even in this case the only info it can provide is the
state of PREEMPT_ACTIVE bit.
Change its single caller to check p->on_rq instead, this should be
the same if p->state != TASK_RUNNING, and kill this helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20141008183348.GC17495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both callers of finish_task_switch() need to recalculate this_rq()
and pass it as an argument, plus __schedule() does this again after
context_switch().
It would be simpler to call this_rq() once in finish_task_switch()
and return the this rq to the callers.
Note: probably "int cpu" in __schedule() should die; it is not used
and both rcu_note_context_switch() and wq_worker_sleeping() do not
really need this argument.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009193232.GB5408@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
finish_task_switch() enables preemption, so post_schedule(rq) can be
called on the wrong (and even dead) CPU. Afaics, nothing really bad
can happen, but in this case we can wrongly clear rq->post_schedule
on that CPU. And this simply looks wrong in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141008193644.GA32055@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In pseudo-interleaved numa_groups, all tasks try to relocate to
the group's preferred_nid. When a group is spread across multiple
NUMA nodes, this can lead to tasks swapping their location with
other tasks inside the same group, instead of swapping location with
tasks from other NUMA groups. This can keep NUMA groups from converging.
Examining all nodes, when dealing with a task in a pseudo-interleaved
NUMA group, avoids this problem. Note that only CPUs in nodes that
improve the task or group score are examined, so the loop isn't too
bad.
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Vinod Chegu" <chegu_vinod@hp.com>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.
The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.
In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.
Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.
Placement policy on directly connected NUMA systems is not affected.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.
Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.
On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.
On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.
Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
[ Changed to use kernel/sched/sched.h ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export some information that is necessary to do placement of
tasks on systems with multi-level NUMA topologies.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1) switched_to_dl() check is wrong. We reschedule only
if rq->curr is deadline task, and we do not reschedule
if it's a lower priority task. But we must always
preempt a task of other classes.
2) dl_task_timer():
Policy does not change in case of priority inheritance.
rt_mutex_setprio() changes prio, while policy remains old.
So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.
(I didn't change switched_from_dl() because no check is necessary
there at all).
I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now... I suppose some performance tests
may work better after this.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
preempt_schedule_context() does preempt_enable_notrace() at the end
and this can call the same function again; exception_exit() is heavy
and it is quite possible that need-resched is true again.
1. Change this code to dec preempt_count() and check need_resched()
by hand.
2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
the enable/disable dance around __schedule(). But in this case
we need to move into sched/core.c.
3. Cosmetic, but x86 forgets to declare this function. This doesn't
really matter because it is only called by asm helpers, still it
make sense to add the declaration into asm/preempt.h to match
preempt_schedule().
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While offling node by hot removing memory, the following divide error
occurs:
divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [<ffffffff810a7081>] task_numa_fault
RSP <ffff88084eb2bcb0>
The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:
o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------
2. node 1 is offlined by hot removing memory.
3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.
4. The values(0) of fault_types[] pass to update_task_scan_period().
5. numa_faults_locality[1] is set to 1. So the following division is
calculated.
static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}
6. But both of private and shared are set to 0. So divide error
occurs here.
The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().
Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
dl_task_timer() is racy against several paths. Daniel noticed that
the replenishment timer may experience a race condition against an
enqueue_dl_entity() called from rt_mutex_setprio(). With his own
words:
rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
throttled is 0
=> BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
enqueued on the -deadline runqueue.
As we do for the other races, we just bail out in the replenishment
timer code.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: vincent@legout.info
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the deboost path, right after the dl_boosted flag has been
reset, we can currently end up replenishing using -deadline
parameters of a !SCHED_DEADLINE entity. This of course causes
a bug, as those parameters are empty.
In the case depicted above it is safe to simply bail out, as
the deboosted task is going to be back to its original scheduling
class anyway.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: vincent@legout.info
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The race may happen when somebody is changing task_group of a forking task.
Child's cgroup is the same as parent's after dup_task_struct() (there just
memory copying). Also, cfs_rq and rt_rq are the same as parent's.
But if parent changes its task_group before it's called cgroup_post_fork(),
we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
the same, while child's task_group changes in cgroup_post_fork().
To fix this we introduce fork() method, which calls sched_move_task() directly.
This function changes sched_task_group on appropriate (also its logic has
no problem with freshly created tasks, so we shouldn't introduce something
special; we are able just to use it).
Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
introduce two configs:
- hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
depend on
- visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
that solves several problems:
- tracing and others that wish to use eBPF don't need to depend on NET.
They can use BPF_SYSCALL to allow loading from userspace or select BPF
to use it directly from kernel in NET-less configs.
- in 3.18 programs cannot be attached to events yet, so don't force it on
- when the rest of eBPF infra is there in 3.19+, it's still useful to
switch it off to minimize kernel size
bloat-o-meter on x64 shows:
add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
tested with many different config combinations. Hopefully didn't miss anything.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a device's dev_pm_ops::freeze callback fails during the QUIESCE
phase, we don't rollback things correctly calling the thaw and complete
callbacks. This could leave some devices in a suspended state in case of
an error during resuming from hibernation.
Signed-off-by: Imre Deak <imre.deak@intel.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
exit_pi_state_list takes the hb lock first, and most callers of
free_pi_state do too. requeue_pi doesn't, which means free_pi_state
can free the pi_state out from under exit_pi_state_list. For example:
task A | task B
exit_pi_state_list |
pi_state = |
curr->pi_state_list->next |
| futex_requeue(requeue_pi=1)
| // pi_state is the same as
| // the one in task A
| free_pi_state(pi_state)
| list_del_init(&pi_state->list)
| kfree(pi_state)
list_del_init(&pi_state->list) |
Move the free_pi_state calls in requeue_pi to before it drops the hb
locks which it's already holding.
[ tglx: Removed a pointless free_pi_state() call and the hb->lock held
debugging. The latter comes via a seperate patch ]
Signed-off-by: Brian Silverman <bsilver16384@gmail.com>
Cc: austin.linux@gmail.com
Cc: darren@dvhart.com
Cc: peterz@infradead.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Update our documentation as of fix 76835b0ebf (futex: Ensure
get_futex_key_refs() always implies a barrier). Explicitly
state that we don't do key referencing for private futexes.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matteo Franchin <Matteo.Franchin@arm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andrey reported that on a kernel with UBSan enabled he found:
UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34
I guess it should be 1ULL here instead of 1U:
(!ismax || evt->mult <= (1U << evt->shift)))
That's indeed the correct solution because shift might be 32.
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.
Initialize sigev_value with 0 to plug the stack info leak.
Found in the PaX patch, written by the PaX Team.
Fixes: 5a9fa73072 ("posix-timers: kill ->it_sigev_signo and...")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: PaX Team <pageexec@freemail.hu>
Cc: <stable@vger.kernel.org> # v2.6.28+
Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When modifying code, ftrace has several checks to make sure things
are being done correctly. One of them is to make sure any code it
modifies is exactly what it expects it to be before it modifies it.
In order to do so with the new trampoline logic, it must be able
to find out what trampoline a function is hooked to in order to
see if the code that hooks to it is what's expected.
The logic to find the trampoline from a record (accounting descriptor
for a function that is hooked) needs to only look at the "old_hash"
of an ops that is being modified. The old_hash is the list of function
an ops is hooked to before its update. Since a record would only be
pointing to an ops that is being modified if it was already hooked
before.
Currently, it can pick a modified ops based on its new functions it
will be hooked to, and this picks the wrong trampoline and causes
the check to fail, disabling ftrace.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace: squash into ordering of ops for modification
The code that checks for trampolines when modifying function hooks
tests against a modified ops "old_hash". But the ops old_hash pointer
is not being updated before the changes are made, making it possible
to not find the right hash to the callback and possibly causing
ftrace to break in accounting and disable itself.
Have the ops set its old_hash before the modifying takes place.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit dd56af42bd (rcu: Eliminate deadlock between CPU hotplug and
expedited grace periods) was incomplete. Although it did eliminate
deadlocks involving synchronize_sched_expedited()'s acquisition of
cpu_hotplug.lock via get_online_cpus(), it did nothing about the similar
deadlock involving acquisition of this same lock via put_online_cpus().
This deadlock became apparent with testing involving hibernation.
This commit therefore changes put_online_cpus() acquisition of this lock
to be conditional, and increments a new cpu_hotplug.puts_pending field
in case of acquisition failure. Then cpu_hotplug_begin() checks for this
new field being non-zero, and applies any changes to cpu_hotplug.refcount.
Reported-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Borislav Petkov <bp@suse.de>
Clean up the code in process.c after recent changes to get rid of
unnecessary labels and goto statements.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
while comparing for verifier state equivalency the comparison
was missing a check for uninitialized register.
Make sure it does so and add a testcase.
Fixes: f1bca824da ("bpf: add search pruning optimization to verifier")
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
as per 0c740d0afc (introduce for_each_thread() to replace the buggy
while_each_thread()) get rid of do_each_thread { } while_each_thread()
construct and replace it by a more error prone for_each_thread.
This patch doesn't introduce any user visible change.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.
Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.
Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.
Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael
Fixes: f660daac47 (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
__thaw_task() no longer clears frozen flag since commit a3201227f8
(freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE).
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since f660daac47 (oom: thaw threads if oom killed thread is frozen
before deferring) OOM killer relies on being able to thaw a frozen task
to handle OOM situation but a3201227f8 (freezer: make freezing() test
freeze conditions in effect instead of TIF_FREEZE) has reorganized the
code and stopped clearing freeze flag in __thaw_task. This means that
the target task only wakes up and goes into the fridge again because the
freezing condition hasn't changed for it. This reintroduces the bug
fixed by f660daac47.
Fix the issue by checking for TIF_MEMDIE thread flag in
freezing_slow_path and exclude the task from freezing completely. If a
task was already frozen it would get woken by __thaw_task from OOM killer
and get out of freezer after rechecking freezing().
Changes since v1
- put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator
as per Oleg
- return __thaw_task into oom_scan_process_thread because
oom_kill_process will not wake task in the fridge because it is
sleeping uninterruptible
[mhocko@suse.cz: rewrote the changelog]
Fixes: a3201227f8 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE)
Cc: 3.3+ <stable@vger.kernel.org> # 3.3+
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull audit updates from Eric Paris:
"So this change across a whole bunch of arches really solves one basic
problem. We want to audit when seccomp is killing a process. seccomp
hooks in before the audit syscall entry code. audit_syscall_entry
took as an argument the arch of the given syscall. Since the arch is
part of what makes a syscall number meaningful it's an important part
of the record, but it isn't available when seccomp shoots the
syscall...
For most arch's we have a better way to get the arch (syscall_get_arch)
So the solution was two fold: Implement syscall_get_arch() everywhere
there is audit which didn't have it. Use syscall_get_arch() in the
seccomp audit code. Having syscall_get_arch() everywhere meant it was
a useless flag on the stack and we could get rid of it for the typical
syscall entry.
The other changes inside the audit system aren't grand, fixed some
records that had invalid spaces. Better locking around the task comm
field. Removing some dead functions and structs. Make some things
static. Really minor stuff"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: rename audit_log_remove_rule to disambiguate for trees
audit: cull redundancy in audit_rule_change
audit: WARN if audit_rule_change called illegally
audit: put rule existence check in canonical order
next: openrisc: Fix build
audit: get comm using lock to avoid race in string printing
audit: remove open_arg() function that is never used
audit: correct AUDIT_GET_FEATURE return message type
audit: set nlmsg_len for multicast messages.
audit: use union for audit_field values since they are mutually exclusive
audit: invalid op= values for rules
audit: use atomic_t to simplify audit_serial()
kernel/audit.c: use ARRAY_SIZE instead of sizeof/sizeof[0]
audit: reduce scope of audit_log_fcaps
audit: reduce scope of audit_net_id
audit: arm64: Remove the audit arch argument to audit_syscall_entry
arm64: audit: Add audit hook in syscall_trace_enter/exit()
audit: x86: drop arch from __audit_syscall_entry() interface
sparc: implement is_32bit_task
sparc: properly conditionalize use of TIF_32BIT
...
Commit b0c29f79ec (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd0 (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.
However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).
Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Matteo Franchin <Matteo.Franchin@arm.com>
Fixes: b0c29f79ec (futexes: Avoid taking the hb->lock if there's nothing to wake up)
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.
This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().
Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().
This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"
* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...
A panic was seen in the following sitation.
There are two threads running on the system. The first thread is a system
monitoring thread that is reading /proc/modules. The second thread is
loading and unloading a module (in this example I'm using my simple
dummy-module.ko). Note, in the "real world" this occurred with the qlogic
driver module.
When doing this, the following panic occurred:
------------[ cut here ]------------
kernel BUG at kernel/module.c:3739!
invalid opcode: 0000 [#1] SMP
Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module]
CPU: 37 PID: 186343 Comm: cat Tainted: GF O-------------- 3.10.0+ #7
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000
RIP: 0010:[<ffffffff810d64c5>] [<ffffffff810d64c5>] module_flags+0xb5/0xc0
RSP: 0018:ffff88080fa7fe18 EFLAGS: 00010246
RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000
RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000
RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000
R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000
FS: 00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c
ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00
ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0
Call Trace:
[<ffffffff810d666c>] m_show+0x19c/0x1e0
[<ffffffff811e4d7e>] seq_read+0x16e/0x3b0
[<ffffffff812281ed>] proc_reg_read+0x3d/0x80
[<ffffffff811c0f2c>] vfs_read+0x9c/0x170
[<ffffffff811c1a58>] SyS_read+0x58/0xb0
[<ffffffff81605829>] system_call_fastpath+0x16/0x1b
Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
RIP [<ffffffff810d64c5>] module_flags+0xb5/0xc0
RSP <ffff88080fa7fe18>
Consider the two processes running on the system.
CPU 0 (/proc/modules reader)
CPU 1 (loading/unloading module)
CPU 0 opens /proc/modules, and starts displaying data for each module by
traversing the modules list via fs/seq_file.c:seq_open() and
fs/seq_file.c:seq_read(). For each module in the modules list, seq_read
does
op->start() <-- this is a pointer to m_start()
op->show() <- this is a pointer to m_show()
op->stop() <-- this is a pointer to m_stop()
The m_start(), m_show(), and m_stop() module functions are defined in
kernel/module.c. The m_start() and m_stop() functions acquire and release
the module_mutex respectively.
ie) When reading /proc/modules, the module_mutex is acquired and released
for each module.
m_show() is called with the module_mutex held. It accesses the module
struct data and attempts to write out module data. It is in this code
path that the above BUG_ON() warning is encountered, specifically m_show()
calls
static char *module_flags(struct module *mod, char *buf)
{
int bx = 0;
BUG_ON(mod->state == MODULE_STATE_UNFORMED);
...
The other thread, CPU 1, in unloading the module calls the syscall
delete_module() defined in kernel/module.c. The module_mutex is acquired
for a short time, and then released. free_module() is called without the
module_mutex. free_module() then sets mod->state = MODULE_STATE_UNFORMED,
also without the module_mutex. Some additional code is called and then the
module_mutex is reacquired to remove the module from the modules list:
/* Now we can delete it from the lists */
mutex_lock(&module_mutex);
stop_machine(__unlink_module, mod, NULL);
mutex_unlock(&module_mutex);
This is the sequence of events that leads to the panic.
CPU 1 is removing dummy_module via delete_module(). It acquires the
module_mutex, and then releases it. CPU 1 has NOT set dummy_module->state to
MODULE_STATE_UNFORMED yet.
CPU 0, which is reading the /proc/modules, acquires the module_mutex and
acquires a pointer to the dummy_module which is still in the modules list.
CPU 0 calls m_show for dummy_module. The check in m_show() for
MODULE_STATE_UNFORMED passed for dummy_module even though it is being
torn down.
Meanwhile CPU 1, which has been continuing to remove dummy_module without
holding the module_mutex, now calls free_module() and sets
dummy_module->state to MODULE_STATE_UNFORMED.
CPU 0 now calls module_flags() with dummy_module and ...
static char *module_flags(struct module *mod, char *buf)
{
int bx = 0;
BUG_ON(mod->state == MODULE_STATE_UNFORMED);
and BOOM.
Acquire and release the module_mutex lock around the setting of
MODULE_STATE_UNFORMED in the teardown path, which should resolve the
problem.
Testing: In the unpatched kernel I can panic the system within 1 minute by
doing
while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done
and
while (true) do cat /proc/modules; done
in separate terminals.
In the patched kernel I was able to run just over one hour without seeing
any issues. I also verified the output of panic via sysrq-c and the output
of /proc/modules looks correct for all three states for the dummy_module.
dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-)
dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE)
dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Merge second patch-bomb from Andrew Morton:
- a few hotfixes
- drivers/dma updates
- MAINTAINERS updates
- Quite a lot of lib/ updates
- checkpatch updates
- binfmt updates
- autofs4
- drivers/rtc/
- various small tweaks to less used filesystems
- ipc/ updates
- kernel/watchdog.c changes
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
kernel/param: consolidate __{start,stop}___param[] in <linux/moduleparam.h>
ia64: remove duplicate declarations of __per_cpu_start[] and __per_cpu_end[]
frv: remove unused declarations of __start___ex_table and __stop___ex_table
kvm: ensure hard lockup detection is disabled by default
kernel/watchdog.c: control hard lockup detection default
staging: rtl8192u: use %*pEn to escape buffer
staging: rtl8192e: use %*pEn to escape buffer
staging: wlan-ng: use %*pEhp to print SN
lib80211: remove unused print_ssid()
wireless: hostap: proc: print properly escaped SSID
wireless: ipw2x00: print SSID via %*pE
wireless: libertas: print esaped string via %*pE
lib/vsprintf: add %*pE[achnops] format specifier
lib / string_helpers: introduce string_escape_mem()
lib / string_helpers: refactoring the test suite
lib / string_helpers: move documentation to c-file
include/linux: remove strict_strto* definitions
arch/x86/mm/numa.c: fix boot failure when all nodes are hotpluggable
fs: check bh blocknr earlier when searching lru
...
Pull s390 updates from Martin Schwidefsky:
"This patch set contains the main portion of the changes for 3.18 in
regard to the s390 architecture. It is a bit bigger than usual,
mainly because of a new driver and the vector extension patches.
The interesting bits are:
- Quite a bit of work on the tracing front. Uprobes is enabled and
the ftrace code is reworked to get some of the lost performance
back if CONFIG_FTRACE is enabled.
- To improve boot time with CONFIG_DEBIG_PAGEALLOC, support for the
IPTE range facility is added.
- The rwlock code is re-factored to improve writer fairness and to be
able to use the interlocked-access instructions.
- The kernel part for the support of the vector extension is added.
- The device driver to access the CD/DVD on the HMC is added, this
will hopefully come in handy to improve the installation process.
- Add support for control-unit initiated reconfiguration.
- The crypto device driver is enhanced to enable the additional AP
domains and to allow the new crypto hardware to be used.
- Bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390/ftrace: simplify enabling/disabling of ftrace_graph_caller
s390/ftrace: remove 31 bit ftrace support
s390/kdump: add support for vector extension
s390/disassembler: add vector instructions
s390: add support for vector extension
s390/zcrypt: Toleration of new crypto hardware
s390/idle: consolidate idle functions and definitions
s390/nohz: use a per-cpu flag for arch_needs_cpu
s390/vtime: do not reset idle data on CPU hotplug
s390/dasd: add support for control unit initiated reconfiguration
s390/dasd: fix infinite loop during format
s390/mm: make use of ipte range facility
s390/setup: correct 4-level kernel page table detection
s390/topology: call set_sched_topology early
s390/uprobes: architecture backend for uprobes
s390/uprobes: common library for kprobes and uprobes
s390/rwlock: use the interlocked-access facility 1 instructions
s390/rwlock: improve writer fairness
s390/rwlock: remove interrupt-enabling rwlock variant.
s390/mm: remove change bit override support
...
Pull x86 seccomp changes from Ingo Molnar:
"This tree includes x86 seccomp filter speedups and related preparatory
work, which touches core seccomp facilities as well.
The main idea is to split seccomp into two phases, to be able to enter
a simple fast path for syscalls with ptrace side effects.
There's no substantial user-visible (and ABI) effects expected from
this, except a change in how we emit a better audit record for
SECCOMP_RET_TRACE events"
* 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
x86: Split syscall_trace_enter into two phases
x86, entry: Only call user_exit if TIF_NOHZ
x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
seccomp: Document two-phase seccomp and arch-provided seccomp_data
seccomp: Allow arch code to provide seccomp_data
seccomp: Refactor the filter callback and the API
seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing
Consolidate the various external const and non-const declarations of
__start___param[] and __stop___param in <linux/moduleparam.h>. This
requires making a few struct kernel_param pointers in kernel/params.c
const.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases we don't want hard lockup detection enabled by default.
An example is when running as a guest. Introduce
watchdog_enable_hardlockup_detector(bool)
allowing those cases to disable hard lockup detection. This must be
executed early by the boot processor from e.g. smp_prepare_boot_cpu, in
order to allow kernel command line arguments to override it, as well as
to avoid hard lockup detection being enabled before we've had a chance
to indicate that it's unwanted. In summary,
initial boot: default=enabled
smp_prepare_boot_cpu
watchdog_enable_hardlockup_detector(false): default=disabled
cmdline has 'nmi_watchdog=1': default=enabled
The running kernel still has the ability to enable/disable at any time
with /proc/sys/kernel/nmi_watchdog us usual. However even when the
default has been overridden /proc/sys/kernel/nmi_watchdog will initially
show '1'. To truly turn it on one must disable/enable it, i.e.
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 1 > /proc/sys/kernel/nmi_watchdog
This patch will be immediately useful for KVM with the next patch of this
series. Other hypervisor guest types may find it useful as well.
[akpm@linux-foundation.org: fix build]
[dzickus@redhat.com: fix compile issues on sparc]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.
To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a large university system in the UK that is experiencing very long
delays modprobing the driver for a specific I/O device. The delay is from
8-10 minutes per device and there are 31 devices in the system. This 4 to
5 hour delay in starting up those I/O devices is very much a burden on the
customer.
There are two causes for requiring a restart/reload of the drivers. First
is periodic preventive maintenance (PM) and the second is if any of the
devices experience a fatal error. Both of these trigger this excessively
long delay in bringing the system back up to full capability.
The problem was tracked down to a very slow IOREMAP operation and the
excessively long ioresource lookup to insure that the user is not
attempting to ioremap RAM. These patches provide a speed up to that
function.
The modprobe time appears to be affected quite a bit by previous activity
on the ioresource list, which I suspect is due to cache preloading. While
the overall improvement is impacted by other overhead of starting the
devices, this drastically improves the modprobe time.
Also our system is considerably smaller so the percentages gained will not
be the same. Best case improvement with the modprobe on our 20 device
smallish system was from 'real 5m51.913s' to 'real 0m18.275s'.
This patch (of 2):
Since the ioremap operation is verifying that the specified address range
is NOT RAM, it will search the entire ioresource list if the condition is
true. To make matters worse, it does this one 4k page at a time. For a
128M BAR region this is 32 passes to determine the entire region does not
contain any RAM addresses.
This patch provides another resource lookup function, region_is_ram, that
searches for the entire region specified, verifying that it is completely
contained within the resource region. If it is found, then it is checked
to be RAM or not, within a single pass.
The return result reflects if it was found or not (-1), and whether it is
RAM (1) or not (0). This allows the caller to fallback to the previous
page by page search if it was not found.
[akpm@linux-foundation.org: fix spellos and typos in comment]
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Alex Thorlton <athorlton@sgi.com>
Reviewed-by: Cliff Wickman <cpw@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a cleanup. In function parse_crashkernel_suffix, the parameter
crash_base is not used. So here remove it.
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In locate_mem_hole functions, a memory hole is located and added as
kexec_segment. But from the name of locate_mem_hole, it should only take
responsibility of searching a available memory hole to contain data of a
specified size.
So in this patch add a new field 'mem' into kexec_buf, then take that
kexec segment adding code out of locate_mem_hole_top_down and
locate_mem_hole_bottom_up. This make clear of the functionality of
locate_mem_hole just like it declars to do. And by this
locate_mem_hole_callback chould be used later if anyone want to locate a
memory hole for other use.
Meanwhile Vivek suggested opening code function __kexec_add_segment(),
that way we have to retreive ksegment pointer once and it is easy to read.
So just do it in this patch and remove __kexec_add_segment() since no one
use it anymore.
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce boilerplate code by using __seq_open_private() instead of
seq_open() in kallsyms_open().
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Gideon Israel Dsouza <gidisrael@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 458df9fd48 ("printk: remove separate printk_sched buffers and use
printk buf instead") hardcodes printk_deferred() to KERN_WARNING and
inserts the string "[sched_delayed] " before the actual message. However
it doesn't take into account the KERN_* prefix of the message, that now
ends up in the middle of the output:
[sched_delayed] ^a4CE: hpet increased min_delta_ns to 20115 nsec
Fix this by just getting rid of the "[sched_delayed] " scnprintf(). The
prefix is useless since 458df9fd48 anyway since from that moment
printk_deferred() inserts the message into the kernel printk buffer
immediately. So if the message eventually gets printed to console, it is
printed in the correct order with other messages and there's no need for
any special prefix. And if the kernel crashes before the message makes it
to console, then prefix in the printk buffer doesn't make the situation
any better.
Link: http://lkml.org/lkml/2014/9/14/4
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When configuring a uniprocessor kernel, don't bother the user with an
irrelevant LOG_CPU_MAX_BUF_SHIFT question, and don't build the unused
code.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
Pull perf fixes from Ingo Molnar:
"Two leftover fixes from the v3.17 cycle - these will be forwarded to
stable as well, if they prove problem-free in wider testing as well"
[ Side note: the "fix perf bug in fork()" fix had also come in through
Andrew's patch-bomb - Linus ]
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix perf bug in fork()
perf: Fix unclone_ctx() vs. locking
Pull perf updates from Ingo Molnar:
"Kernel side updates:
- Fix and enhance poll support (Jiri Olsa)
- Re-enable inheritance optimization (Jiri Olsa)
- Enhance Intel memory events support (Stephane Eranian)
- Refactor the Intel uncore driver to be more maintainable (Zheng
Yan)
- Enhance and fix Intel CPU and uncore PMU drivers (Peter Zijlstra,
Andi Kleen)
- [ plus various smaller fixes/cleanups ]
User visible tooling updates:
- Add +field argument support for --field option, so that one can add
fields to the default list of fields to show, ie now one can just
do:
perf report --fields +pid
And the pid will appear in addition to the default fields (Jiri
Olsa)
- Add +field argument support for --sort option (Jiri Olsa)
- Honour -w in the report tools (report, top), allowing to specify
the widths for the histogram entries columns (Namhyung Kim)
- Properly show submicrosecond times in 'perf kvm stat' (Christian
Borntraeger)
- Add beautifier for mremap flags param in 'trace' (Alex Snast)
- perf script: Allow callchains if any event samples them
- Don't truncate Intel style addresses in 'annotate' (Alex Converse)
- Allow profiling when kptr_restrict == 1 for non root users, kernel
samples will just remain unresolved (Andi Kleen)
- Allow configuring default options for callchains in config file
(Namhyung Kim)
- Support operations for shared futexes. (Davidlohr Bueso)
- "perf kvm stat report" improvements by Alexander Yarygin:
- Save pid string in opts.target.pid
- Enable the target.system_wide flag
- Unify the title bar output
- [ plus lots of other fixes and small improvements. ]
Tooling infrastructure changes:
- Refactor unit and scale function parameters for PMU parsing
routines (Matt Fleming)
- Improve DSO long names lookup with rbtree, resulting in great
speedup for workloads with lots of DSOs (Waiman Long)
- We were not handling POLLHUP notifications for event file
descriptors
Fix it by filtering entries in the events file descriptor array
after poll() returns, refcounting mmaps so that when the last fd
pointing to a perf mmap goes away we do the unmap (Arnaldo Carvalho
de Melo)
- Intel PT prep work, from Adrian Hunter, including:
- Let a user specify a PMU event without any config terms
- Add perf-with-kcore script
- Let default config be defined for a PMU
- Add perf_pmu__scan_file()
- Add a 'perf test' for tracking with sched_switch
- Add 'flush' callback to scripting API
- Use ring buffer consume method to look like other tools (Arnaldo
Carvalho de Melo)
- hists browser (used in top and report) refactorings, getting rid of
unused variables and reducing source code size by handling similar
cases in a fewer functions (Namhyung Kim).
- Replace thread unsafe strerror() with strerror_r() accross the
whole tools/perf/ tree (Masami Hiramatsu)
- Rename ordered_samples to ordered_events and allow setting a queue
size for ordering events (Jiri Olsa)
- [ plus lots of fixes, cleanups and other improvements ]"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (198 commits)
perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment
perf/x86/intel/uncore: Fix minor race in box set up
perf record: Fix error message for --filter option not coming after tracepoint
perf tools: Fix build breakage on arm64 targets
perf symbols: Improve DSO long names lookup speed with rbtree
perf symbols: Encapsulate dsos list head into struct dsos
perf bench futex: Sanitize -q option in requeue
perf bench futex: Support operations for shared futexes
perf trace: Fix mmap return address truncation to 32-bit
perf tools: Refactor unit and scale function parameters
perf tools: Fix line number in the config file error message
perf tools: Convert {record,top}.call-graph option to call-graph.record-mode
perf tools: Introduce perf_callchain_config()
perf callchain: Move some parser functions to callchain.c
perf tools: Move callchain config from record_opts to callchain_param
perf hists browser: Fix callchain print bug on TUI
perf tools: Use ACCESS_ONCE() instead of volatile cast
perf tools: Modify error code for when perf_session__new() fails
perf tools: Fix perf record as non root with kptr_restrict == 1
perf stat: Fix --per-core on multi socket systems
...
Pull core locking updates from Ingo Molnar:
"The main updates in this cycle were:
- mutex MCS refactoring finishing touches: improve comments, refactor
and clean up code, reduce debug data structure footprint, etc.
- qrwlock finishing touches: remove old code, self-test updates.
- small rwsem optimization
- various smaller fixes/cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Revert qrwlock recusive stuff
locking/rwsem: Avoid double checking before try acquiring write lock
locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
locking/semaphore: Resolve some shadow warnings
locking/selftest: Support queued rwlock
locking/lockdep: Restrict the use of recursive read_lock() with qrwlock
locking/spinlocks: Always evaluate the second argument of spin_lock_nested()
locking/Documentation: Update locking/mutex-design.txt disadvantages
locking/Documentation: Move locking related docs into Documentation/locking/
locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate
locking/mutexes: Refactor optimistic spinning code
locking/mcs: Remove obsolete comment
locking/mutexes: Document quick lock release when unlocking
locking/mutexes: Standardize arguments in lock/unlock slowpaths
locking: Remove deprecated smp_mb__() barriers
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- changes related to No-CBs CPUs and NO_HZ_FULL
- RCU-tasks implementation
- torture-test updates
- miscellaneous fixes
- locktorture updates
- RCU documentation updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
workqueue: Use cond_resched_rcu_qs macro
workqueue: Add quiescent state between work items
locktorture: Cleanup header usage
locktorture: Cannot hold read and write lock
locktorture: Fix __acquire annotation for spinlock irq
locktorture: Support rwlocks
rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
locktorture: Document boot/module parameters
rcutorture: Rename rcutorture_runnable parameter
locktorture: Add test scenario for rwsem_lock
locktorture: Add test scenario for mutex_lock
locktorture: Make torture scripting account for new _runnable name
locktorture: Introduce torture context
locktorture: Support rwsems
locktorture: Add infrastructure for torturing read locks
torture: Address race in module cleanup
locktorture: Make statistics generic
locktorture: Teach about lock debugging
locktorture: Support mutexes
locktorture: Add documentation
...
Pull vfs updates from Al Viro:
"The big thing in this pile is Eric's unmount-on-rmdir series; we
finally have everything we need for that. The final piece of prereqs
is delayed mntput() - now filesystem shutdown always happens on
shallow stack.
Other than that, we have several new primitives for iov_iter (Matt
Wilcox, culled from his XIP-related series) pushing the conversion to
->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
cleanups and fixes (including the external name refcounting, which
gives consistent behaviour of d_move() wrt procfs symlinks for long
and short names alike) and assorted cleanups and fixes all over the
place.
This is just the first pile; there's a lot of stuff from various
people that ought to go in this window. Starting with
unionmount/overlayfs mess... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
fs/file_table.c: Update alloc_file() comment
vfs: Deduplicate code shared by xattr system calls operating on paths
reiserfs: remove pointless forward declaration of struct nameidata
don't need that forward declaration of struct nameidata in dcache.h anymore
take dname_external() into fs/dcache.c
let path_init() failures treated the same way as subsequent link_path_walk()
fix misuses of f_count() in ppp and netlink
ncpfs: use list_for_each_entry() for d_subdirs walk
vfs: move getname() from callers to do_mount()
gfs2_atomic_open(): skip lookups on hashed dentry
[infiniband] remove pointless assignments
gadgetfs: saner API for gadgetfs_create_file()
f_fs: saner API for ffs_sb_create_file()
jfs: don't hash direct inode
[s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
ecryptfs: ->f_op is never NULL
android: ->f_op is never NULL
nouveau: __iomem misannotations
missing annotation in fs/file.c
fs: namespace: suppress 'may be used uninitialized' warnings
...
code screem nasty warnings:
WARNING: CPU: 0 PID: 91 at kernel/sched/core.c:7253 __might_sleep+0x9a/0x378()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d79b511>] event_test_thread+0x48/0x93
Modules linked in:
CPU: 0 PID: 91 Comm: test-events Not tainted 3.17.0-rc7-00109-g2f85d18 #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
0000000000000000 ffff880010ec3c80 ffffffff8c696943 ffff880010ec3cb8
ffffffff8be7cae5 ffffffff8bead236 0000000000000001 ffff88001161fa01
0000000000000001 0000000000000000 ffff880010ec3d20 ffffffff8be7cb46
Call Trace:
[<ffffffff8c696943>] dump_stack+0x19/0x1b
[<ffffffff8be7cae5>] warn_slowpath_common+0x8f/0xa8
[<ffffffff8bead236>] ? __might_sleep+0x9a/0x378
[<ffffffff8be7cb46>] warn_slowpath_fmt+0x48/0x50
[<ffffffff8be0dd55>] ? sched_clock+0x9/0xd
[<ffffffff8d79b511>] ? event_test_thread+0x48/0x93
[<ffffffff8d79b511>] ? event_test_thread+0x48/0x93
[<ffffffff8bead236>] __might_sleep+0x9a/0x378
[<ffffffff8c6a0227>] down_read+0x26/0x98
[<ffffffff8be8f503>] exit_signals+0x27/0x1c2
[<ffffffff8be7fedd>] do_exit+0x193/0x10bd
[<ffffffff8bfd1969>] ? kfree+0x4a0/0x4d7
[<ffffffff8d79b4c9>] ? event_trace_self_tests+0x6d7/0x6d7
[<ffffffff8d79b4c9>] ? event_trace_self_tests+0x6d7/0x6d7
[<ffffffff8bea4b65>] kthread+0x156/0x156
[<ffffffff8c69c0f8>] ? wait_for_common+0x3e/0x224
[<ffffffff8bea4a0f>] ? insert_kthread_work+0xe7/0xe7
[<ffffffff8c6a353a>] ret_from_fork+0x7a/0xb0
[<ffffffff8bea4a0f>] ? insert_kthread_work+0xe7/0xe7
---[ end trace 14d02ef17adbc114 ]---
These are triggered by some self tests that run at start up when
configure in. Although the code is technically correct, they are a little
sloppy and not very robust. They work now because it runs at boot up
and the tests do not call anything that might trigger a spurious
wake up. But that doesn't mean those tests wont change in the future.
It's best to clean them now to make sure the tests used to test the
internal workings of the system don't cause breakage themselves.
This also quiets the warnings made by the new checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUNrcAAAoJEKQekfcNnQGu+oYH/3NaLEKOwQU+x0aL/rfSFB86
qtIq3X4iHGekGjrlN38N2Z36izI9AoYuGrWYReMFy1VcvnRxPAl1mc0y0dZfdW/C
KRLwKTAu0t78Ab8vzyXVDxS+Bs/zEi6V/8HykBFbCthiDz5IbTvxCoeS19O/X9CU
ptVKllUlywjKQD5UMiJwk7eOB5GspOeBgNu9MOh61ZfbYBVsl1hPqmD0gEaSH2Me
wLyDlIyc0P9dfeYeaqYblkiBaXLk2urZDU2Enffi1aueEwwWuN5x+DPGc6d6nGQW
fnworqoiYzz+maQoASwaLdCfJAP3cX5Ye7qWQk7QEtp4Ypdh5j7EacAf9pKEJg8=
=goKt
-----END PGP SIGNATURE-----
Merge tag 'trace-3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Seems that Peter Zijlstra added a new check that is making old code
scream nasty warnings:
WARNING: CPU: 0 PID: 91 at kernel/sched/core.c:7253 __might_sleep+0x9a/0x378()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d79b511>] event_test_thread+0x48/0x93
Call Trace:
__might_sleep+0x9a/0x378
down_read+0x26/0x98
exit_signals+0x27/0x1c2
do_exit+0x193/0x10bd
kthread+0x156/0x156
ret_from_fork+0x7a/0xb0
These are triggered by some self tests that run at start up when
configure in. Although the code is technically correct, they are a
little sloppy and not very robust. They work now because it runs at
boot up and the tests do not call anything that might trigger a
spurious wake up. But that doesn't mean those tests wont change in
the future.
It's best to clean them now to make sure the tests used to test the
internal workings of the system don't cause breakage themselves.
This also quiets the warnings made by the new checks"
* tag 'trace-3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Clean up scheduling in trace_wakeup_test_thread()
tracing: Robustify wait loop
of the trampoline logic.
The trampoline logic of 3.17 required a descriptor for every function
that is registered to be traced and uses a trampoline. Currently,
only the function graph tracer uses a trampoline, but if you were
to trace all 32,000 (give or take a few thousand) functions with the
function graph tracer, it would create 32,000 descriptors to let us
know that there's a trampoline associated with it. This takes up a bit
of memory when there's a better way to do it.
The redesign now reuses the ftrace_ops' (what registers the function graph
tracer) hash tables. The hash tables tell ftrace what the tracer
wants to trace or doesn't want to trace. There's two of them: one
that tells us what to trace, the other tells us what not to trace.
If the first one is empty, it means all functions should be traced,
otherwise only the ones that are listed should be. The second hash table
tells us what not to trace, and if it is empty, all functions may be
traced, and if there's any listed, then those should not be traced
even if they exist in the first hash table.
It took a bit of massaging, but now these hashes can be used to
keep track of what has a trampoline and what does not, and allows
the ftrace accounting to work. Now we can trace all functions when using
the function graph trampoline, and avoid needing to create any special
descriptors to hold all the functions that are being traced.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUMwp6AAoJEKQekfcNnQGuIoAIAIsqvTYAnULyzKKCweEZYUfb
XJzz6cN5FPGSXkoeda1ZvnfOlHjFRrWNXzXMB0jZYR2hU++pe3xjtghaNzvbRcyV
wlwDUTsnz235OcOuFEspIwBamhtah96Coiwf/2z/2q6srXlHd/1TrqXB+Fpj1tEK
BkAViGDUEdq/eLZX7nGen36cTb5gpNqV9NjY1CVAK6bSkU/xXk/ArqFy1qy0MPnc
z/9bXdIf+Z6VnG/IzwRc2rwiMFuD1+lpjLuHEqagoHp1D4teCjWPSJl1EKCVAS40
GaCOTUZi92zIVgx8Bb28TglSla9MN65CO3E8dw6hlXUIsNz1p0eavpctnC6ac/Y=
=vDpP
-----END PGP SIGNATURE-----
Merge tag 'trace-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This set has a few minor updates, but the big change is the redesign
of the trampoline logic.
The trampoline logic of 3.17 required a descriptor for every function
that is registered to be traced and uses a trampoline. Currently,
only the function graph tracer uses a trampoline, but if you were to
trace all 32,000 (give or take a few thousand) functions with the
function graph tracer, it would create 32,000 descriptors to let us
know that there's a trampoline associated with it. This takes up a
bit of memory when there's a better way to do it.
The redesign now reuses the ftrace_ops' (what registers the function
graph tracer) hash tables. The hash tables tell ftrace what the
tracer wants to trace or doesn't want to trace. There's two of them:
one that tells us what to trace, the other tells us what not to trace.
If the first one is empty, it means all functions should be traced,
otherwise only the ones that are listed should be. The second hash
table tells us what not to trace, and if it is empty, all functions
may be traced, and if there's any listed, then those should not be
traced even if they exist in the first hash table.
It took a bit of massaging, but now these hashes can be used to keep
track of what has a trampoline and what does not, and allows the
ftrace accounting to work. Now we can trace all functions when using
the function graph trampoline, and avoid needing to create any special
descriptors to hold all the functions that are being traced"
* tag 'trace-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Only disable ftrace_enabled to test buffer in selftest
ftrace: Add sanity check when unregistering last ftrace_ops
kernel: trace_syscalls: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
tracing: generate RCU warnings even when tracepoints are disabled
ftrace: Replace tramp_hash with old_*_hash to save space
ftrace: Annotate the ops operation on update
ftrace: Grab any ops for a rec for enabled_functions output
ftrace: Remove freeing of old_hash from ftrace_hash_move()
ftrace: Set callback to ftrace_stub when no ops are registered
ftrace: Add helper function ftrace_ops_get_func()
ftrace: Add separate function for non recursive callbacks
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUJQ8/AAoJEMsfJm/On5mBMNgP+QEUHpRKJaOGU3jX/ftHH/t3
EoNUx7lZt6Q0c9MB2ySAxILYpWUujc9N0tDkRDyW7mTWunF8gEGiRN+iKaSbzcUN
Y4VffRAbxBasIaBqRtpDl08ycODh6Xu1t8sAao03DdhnMNLGNNO79s3UFHsubdTC
cXx9mfYR/2SHV/0BXiFvKi8ovdqUspdp9cyZO/qc0PVFGbsADx3MNGGzkvWfgvcE
6vXnKnUkZrNl5JPiG77kTKZnDsjEMXggmA9DGWKijFCJjGIbuLiuIDf63Zp+eQ52
mJMRA+ViP/dDgAxY1dkWBcF5nOBT1vTYwLfy69jEoQeHzcomiHVoDKmCSBOpeAEH
G8VoasWKWYpYnlcOJb+XgkA3QTe6mOPgAPzNsbYr0Ep7hMFw66mOQgKbgi6k4Qts
HHimG9pnBYpPlBUfvNh+6K4dHAm0C2IyoZyMhKWsyFH6hkhS8TVM8j0gPR8rTTmk
0a9/e2vxcFnfBe3UAJaqzWRVFsBkOHrTNpG1hvID3Oq8IeywSBXw2VMSR93+mwaB
sa/GCZKlqHGpOfmtILlhiXQX0E/tTHmcrI2VqyCpX0J2CW+MiGvkcGOwKHOJciSA
Cj9D68y837QU/DCpMQ6ec/5wqWqZKz8yQb8kxb6vJcL19JcVKdAiPzbuOI49C3Ux
YxDWoUutzDfVoUD5RhcJ
=cP1w
-----END PGP SIGNATURE-----
Merge tag 'restart-handler-for-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull restart handler infrastructure from Guenter Roeck:
"This series was supposed to be pulled through various trees using it,
and I did not plan to send a separate pull request. As it turns out,
the pinctrl tree did not merge with it, is now upstream, and uses it,
meaning there are now build failures.
Please pull this series directly to fix those build failures"
* tag 'restart-handler-for-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
arm/arm64: unexport restart handlers
watchdog: sunxi: register restart handler with kernel restart handler
watchdog: alim7101: register restart handler with kernel restart handler
watchdog: moxart: register restart handler with kernel restart handler
arm: support restart through restart handler call chain
arm64: support restart through restart handler call chain
power/restart: call machine_restart instead of arm_pm_restart
kernel: add support for kernel restart handler call chain
Rename audit_log_remove_rule() to audit_tree_log_remove_rule() to avoid
confusion with watch and mark rule removal/changes.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Re-factor audit_rule_change() to reduce the amount of code redundancy and
simplify the logic.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Use same rule existence check order as audit_make_tree(), audit_to_watch(),
update_lsm_rule() for legibility.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...
- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.
This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.
Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.
- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.
- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.
There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. Just a handful of cleanup patches"
* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: remove redundant variable in cgroup_mount()"
cgroup: remove redundant variable in cgroup_mount()
cgroup: fix missing unlock in cgroup_release_agent()
cgroup: remove CGRP_RELEASABLE flag
perf/cgroup: Remove perf_put_cgroup()
cgroup: remove redundant check in cgroup_ino()
cpuset: simplify proc_cpuset_show()
cgroup: simplify proc_cgroup_show()
cgroup: use a per-cgroup work for release agent
cgroup: remove bogus comments
cgroup: remove redundant code in cgroup_rmdir()
cgroup: remove some useless forward declarations
cgroup: fix a typo in comment.
Merge patch-bomb from Andrew Morton:
- part of OCFS2 (review is laggy again)
- procfs
- slab
- all of MM
- zram, zbud
- various other random things: arch, filesystems.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
nosave: consolidate __nosave_{begin,end} in <asm/sections.h>
include/linux/screen_info.h: remove unused ORIG_* macros
kernel/sys.c: compat sysinfo syscall: fix undefined behavior
kernel/sys.c: whitespace fixes
acct: eliminate compile warning
kernel/async.c: switch to pr_foo()
include/linux/blkdev.h: use NULL instead of zero
include/linux/kernel.h: deduplicate code implementing clamp* macros
include/linux/kernel.h: rewrite min3, max3 and clamp using min and max
alpha: use Kbuild logic to include <asm-generic/sections.h>
frv: remove deprecated IRQF_DISABLED
frv: remove unused cpuinfo_frv and friends to fix future build error
zbud: avoid accessing last unused freelist
zsmalloc: simplify init_zspage free obj linking
mm/zsmalloc.c: correct comment for fullness group computation
zram: use notify_free to account all free notifications
zram: report maximum used memory
zram: zram memory size limitation
zsmalloc: change return value unit of zs_get_total_size_bytes
zsmalloc: move pages_allocated to zs_pool
...
Fix undefined behavior and compiler warning by replacing right shift 32
with upper_32_bits macro
Signed-off-by: Scotty Bauer <sbauer@eng.utah.edu>
Cc: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix minor errors and warning messages in kernel/sys.c. These errors were
reported by checkpatch while working with some modifications in sys.c
file. Fixing this first will help me to improve my further patches.
ERROR: trailing whitespace - 9
ERROR: do not use assignment in if condition - 4
ERROR: spaces required around that '?' (ctx:VxO) - 10
ERROR: switch and case should be at the same indent - 3
total 26 errors & 3 warnings fixed.
Signed-off-by: vishnu.ps <vishnu.ps@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If ACCT_VERSION is not defined to 3, below warning appears:
CC kernel/acct.o
kernel/acct.c: In function `do_acct_process':
kernel/acct.c:475:24: warning: unused variable `ns' [-Wunused-variable]
[akpm@linux-foundation.org: retain the local for code size improvements
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dump the contents of the relevant struct_mm when we hit the bug condition.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. vma_policy_mof(task) is simply not safe unless task == current,
it can race with do_exit()->mpol_put(). Remove this arg and update
its single caller.
2. vma can not be NULL, remove this check and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The deprecation warnings for the scan_unevictable interface triggers by
scripts doing `sysctl -a | grep something else'. This is annoying and not
helpful.
The interface has been defunct since 264e56d824 ("mm: disable user
interface to manually rescue unevictable pages"), which was in 2011, and
there haven't been any reports of usecases for it, only reports that the
deprecation warnings are annying. It's unlikely that anybody is using
this interface specifically at this point, so remove it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.
A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.
Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.
prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)
struct prctl_mm_map {
__u64 start_code;
__u64 end_code;
__u64 start_data;
__u64 end_data;
__u64 start_brk;
__u64 brk;
__u64 start_stack;
__u64 arg_start;
__u64 arg_end;
__u64 env_start;
__u64 env_end;
__u64 *auxv;
__u32 auxv_size;
__u32 exe_fd;
};
All members except @exe_fd correspond ones of struct mm_struct. To figure
out which available values these members may take here are meanings of the
members.
- start_code, end_code: represent bounds of executable code area
- start_data, end_data: represent bounds of data area
- start_brk, brk: used to calculate bounds for brk() syscall
- start_stack: used when accounting space needed for command
line arguments, environment and shmat() syscall
- arg_start, arg_end, env_start, env_end: represent memory area
supplied for command line arguments and environment variables
- auxv, auxv_size: carries auxiliary vector, Elf format specifics
- exe_fd: file descriptor number for executable link (/proc/self/exe)
Thus we apply the following requirements to the values
1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
interval.
2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
VMAs (say a program maps own new .text and .data segments during execution)
the rest of members should belong to VMA which must exist.
3) Addresses must be ordered, ie @start_ member must not be greater or
equal to appropriate @end_ member.
4) As in regular Elf loading procedure we require that @start_brk and
@brk be greater than @end_data.
5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
exceed existing limit. Same applies to RLIMIT_STACK.
6) Auxiliary vector size must not exceed existing one (which is
predefined as AT_VECTOR_SIZE and depends on architecture).
7) File descriptor passed in @exe_file should be pointing
to executable file (because we use existing prctl_set_mm_exe_file_locked
helper it ensures that the file we are going to use as exe link has all
required permission granted).
Now about where these members are involved inside kernel code:
- @start_code and @end_code are used in /proc/$pid/[stat|statm] output;
- @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
also they are considered if there enough space for brk() syscall
result if RLIMIT_DATA is set;
- @start_brk shown in /proc/$pid/stat output and accounted in brk()
syscall if RLIMIT_DATA is set; also this member is tested to
find a symbolic name of mmap event for perf system (we choose
if event is generated for "heap" area); one more aplication is
selinux -- we test if a process has PROCESS__EXECHEAP permission
if trying to make heap area being executable with mprotect() syscall;
- @brk is a current value for brk() syscall which lays inside heap
area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
provides new memory area to a user space upon brk() completion the
mm::brk is updated to carry new value;
Both @start_brk and @brk are actively used in /proc/$pid/maps
and /proc/$pid/smaps output to find a symbolic name "heap" for
VMA being scanned;
- @start_stack is printed out in /proc/$pid/stat and used to
find a symbolic name "stack" for task and threads in
/proc/$pid/maps and /proc/$pid/smaps output, and as the same
as with @start_brk -- perf system uses it for event naming.
Also kernel treat this member as a start address of where
to map vDSO pages and to check if there is enough space
for shmat() syscall;
- @arg_start, @arg_end, @env_start and @env_end are printed out
in /proc/$pid/stat. Another access to the data these members
represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
Any attempt to read these areas kernel tests with access_process_vm
helper so a user must have enough rights for this action;
- @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
speaking kernel doesn't care much about which exactly data is
sitting there because it is solely for userspace;
- @exe_fd is referred from /proc/$pid/exe and when generating
coredump. We uses prctl_set_mm_exe_file_locked helper to update
this member, so exe-file link modification remains one-shot
action.
Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory). Still we require the
caller to be at least user-namespace root user.
I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.
To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.
[akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Vagin <avagin@openvz.org>
Tested-by: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
and rename the helper to prctl_set_mm_exe_file_locked(). This will allow
to reuse this function in a next patch.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After discussions with Tejun, we don't want to spread the use of
cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
callers, but would rather ensure the callees correctly handle memoryless
nodes. With the previous patches ("topology: add support for
node_to_mem_node() to determine the fallback node" and "slub: fallback to
node_to_mem_node() node if allocating on memoryless node") adding and
using node_to_mem_node(), we can safely undo part of the change to the
kthread logic from 81c98869fa.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, soft lockup detector warns once for each case of process
softlockup. But the thread 'watchdog/n' may not always get the cpu at the
time slot between the task switch of two processes hogging that cpu to
reset soft_watchdog_warn.
An example would be two processes hogging the cpu. Process A causes the
softlockup warning and is killed manually by a user. Process B
immediately becomes the new process hogging the cpu preventing the
softlockup code from resetting the soft_watchdog_warn variable.
This case is a false negative of "warn only once for a process", as there
may be a different process that is going to hog the cpu. Resolve this by
saving/checking the task pointer of the hogging process and use that to
reset soft_watchdog_warn too.
[dzickus@redhat.com: update comment]
Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rework the handling of wakeup IRQs by the IRQ core such that
all of them will be switched over to "wakeup" mode in
suspend_device_irqs() and in that mode the first interrupt
will abort system suspend in progress or wake up the system
if already in suspend-to-idle (or equivalent) without executing
any interrupt handlers. Among other things that eliminates the
wakeup-related motivation to use the IRQF_NO_SUSPEND interrupt
flag with interrupts which don't really need it and should not
use it (Thomas Gleixner and Rafael J Wysocki).
- Switch over ACPI to handling wakeup interrupts with the help
of the new mechanism introduced by the above IRQ core rework
(Rafael J Wysocki).
- Rework the core generic PM domains code to eliminate code that's
not used, add DT support and add a generic mechanism by which
devices can be added to PM domains automatically during
enumeration (Ulf Hansson, Geert Uytterhoeven and Tomasz Figa).
- Add debugfs-based mechanics for debugging generic PM domains
(Maciej Matraszek).
- ACPICA update to upstream version 20140828. Included are updates
related to the SRAT and GTDT tables and the _PSx methods are in
the METHOD_NAME list now (Bob Moore and Hanjun Guo).
- Add _OSI("Darwin") support to the ACPI core (unfortunately, that
can't really be done in a straightforward way) to prevent
Thunderbolt from being turned off on Apple systems after boot
(or after resume from system suspend) and rework the ACPI Smart
Battery Subsystem (SBS) driver to work correctly with Apple
platforms (Matthew Garrett and Andreas Noever).
- ACPI LPSS (Low-Power Subsystem) driver update cleaning up the
code, adding support for 133MHz I2C source clock on Intel Baytrail
to it and making it avoid using UART RTS override with Auto Flow
Control (Heikki Krogerus).
- ACPI backlight updates removing the video_set_use_native_backlight
quirk which is not necessary any more, making the code check the
list of output devices returned by the _DOD method to avoid
creating acpi_video interfaces that won't work and adding a quirk
for Lenovo Ideapad Z570 (Hans de Goede, Aaron Lu and Stepan Bujnak).
- New Win8 ACPI OSI quirks for some Dell laptops (Edward Lin).
- Assorted ACPI code cleanups (Fabian Frederick, Rasmus Villemoes,
Sudip Mukherjee, Yijing Wang, and Zhang Rui).
- cpufreq core updates and cleanups (Viresh Kumar, Preeti U Murthy,
Rasmus Villemoes).
- cpufreq driver updates: cpufreq-cpu0/cpufreq-dt (driver name
change among other things), ppc-corenet, powernv (Viresh Kumar,
Preeti U Murthy, Shilpasri G Bhat, Lucas Stach).
- cpuidle support for DT-based idle states infrastructure, new
ARM64 cpuidle driver, cpuidle core cleanups (Lorenzo Pieralisi,
Rasmus Villemoes).
- ARM big.LITTLE cpuidle driver updates: support for DT-based
initialization and Exynos5800 compatible string (Lorenzo Pieralisi,
Kevin Hilman).
- Rework of the test_suspend kernel command line argument and
a new trace event for console resume (Srinivas Pandruvada,
Todd E Brandt).
- Second attempt to optimize swsusp_free() (hibernation core) to
make it avoid going through all PFNs which may be way too slow on
some systems (Joerg Roedel).
- devfreq updates (Paul Bolle, Punit Agrawal, Ãrjan Eide).
- rockchip-io Adaptive Voltage Scaling (AVS) driver and AVS
entry update in MAINTAINERS (Heiko Stübner, Kevin Hilman).
- PM core fix related to clock management (Geert Uytterhoeven).
- PM core's sysfs code cleanup (Johannes Berg).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUNbJoAAoJEILEb/54YlRxRp8QAJyGIPdx+f03oBir+7vvEwhY
svxd+V9xXK0UgWNGkCvlMk/1RIVy0qqtXliUrDaE+9tcHACA9+iAxMmNmDsjLOiO
gpazuz5kgeznrmp1eNwQnYTt+OCReQIcyCsj4q4fNo9bbETTyr2bRz226LEuZekC
TAiKdphYoOszFBgTVg5gfu+lqjHyXjgXPnwMTlRYn1y4YL2adDIgxj9cFedykTTW
Eu593TY2dH6ovERJ6q3qxZbRuWuxtww95J07b3t2/2Eb3e/R/zlX0/XJ/C88f/m2
DkqngbOYqCdw+zJeN6k8631foyfUwAcTd0sJ1+5nsm5H4NE5NqObjbxOk5/yNht6
HgvgISGHWLerEw+A/Dk6o0oZOtR1G/TAQ5qQk5nUfKT/sSoU+9/USsXtWhXwZCia
XccnJgW6ZtPrJJP3zDnkrxe3gndmLic11QXArw2IhWTsq0sZlAyMgtauBXLdDiQa
H/AMiYrUNmIABef1cirBLTtgXN4Zbsai9vIrxMmV7OgBrclrh52NTjzr05P5Hnl2
fRK56mb6mP59LymI7n8fyXL8tHnbNwFvTaxuvrZmzcYbzL0l9DuPocJrrTHRSfhm
GFfzfvLj0R66ZM4PthRSwz4H2v1FnlRcCkj5k/QjtBPlyzxtOnJveqve5umbrnb9
T5mRmlAs4iYwLuKCVVNT
=sIv/
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"Features-wise, to me the most important this time is a rework of
wakeup interrupts handling in the core that makes them work
consistently across all of the available sleep states, including
suspend-to-idle. Many thanks to Thomas Gleixner for his help with
this work.
Second is an update of the generic PM domains code that has been in
need of some care for quite a while. Unused code is being removed, DT
support is being added and domains are now going to be attached to
devices in bus type code in analogy with the ACPI PM domain. The
majority of work here was done by Ulf Hansson who also has been the
most active developer this time.
Apart from this we have a traditional ACPICA update, this time to
upstream version 20140828 and a few ACPI wakeup interrupts handling
patches on top of the general rework mentioned above. There also are
several cpufreq commits including renaming the cpufreq-cpu0 driver to
cpufreq-dt, as this is what implements generic DT-based cpufreq
support, and a new DT-based idle states infrastructure for cpuidle.
In addition to that, the ACPI LPSS driver is updated, ACPI support for
Apple machines is improved, a few bugs are fixed and a few cleanups
are made all over.
Finally, the Adaptive Voltage Scaling (AVS) subsystem now has a tree
maintained by Kevin Hilman that will be merged through the PM tree.
Numbers-wise, the generic PM domains update takes the lead this time
with 32 non-merge commits, second is cpufreq (15 commits) and the 3rd
place goes to the wakeup interrupts handling rework (13 commits).
Specifics:
- Rework the handling of wakeup IRQs by the IRQ core such that all of
them will be switched over to "wakeup" mode in suspend_device_irqs()
and in that mode the first interrupt will abort system suspend in
progress or wake up the system if already in suspend-to-idle (or
equivalent) without executing any interrupt handlers. Among other
things that eliminates the wakeup-related motivation to use the
IRQF_NO_SUSPEND interrupt flag with interrupts which don't really
need it and should not use it (Thomas Gleixner and Rafael Wysocki)
- Switch over ACPI to handling wakeup interrupts with the help of the
new mechanism introduced by the above IRQ core rework (Rafael Wysocki)
- Rework the core generic PM domains code to eliminate code that's
not used, add DT support and add a generic mechanism by which
devices can be added to PM domains automatically during enumeration
(Ulf Hansson, Geert Uytterhoeven and Tomasz Figa).
- Add debugfs-based mechanics for debugging generic PM domains
(Maciej Matraszek).
- ACPICA update to upstream version 20140828. Included are updates
related to the SRAT and GTDT tables and the _PSx methods are in the
METHOD_NAME list now (Bob Moore and Hanjun Guo).
- Add _OSI("Darwin") support to the ACPI core (unfortunately, that
can't really be done in a straightforward way) to prevent
Thunderbolt from being turned off on Apple systems after boot (or
after resume from system suspend) and rework the ACPI Smart Battery
Subsystem (SBS) driver to work correctly with Apple platforms
(Matthew Garrett and Andreas Noever).
- ACPI LPSS (Low-Power Subsystem) driver update cleaning up the code,
adding support for 133MHz I2C source clock on Intel Baytrail to it
and making it avoid using UART RTS override with Auto Flow Control
(Heikki Krogerus).
- ACPI backlight updates removing the video_set_use_native_backlight
quirk which is not necessary any more, making the code check the
list of output devices returned by the _DOD method to avoid
creating acpi_video interfaces that won't work and adding a quirk
for Lenovo Ideapad Z570 (Hans de Goede, Aaron Lu and Stepan Bujnak)
- New Win8 ACPI OSI quirks for some Dell laptops (Edward Lin)
- Assorted ACPI code cleanups (Fabian Frederick, Rasmus Villemoes,
Sudip Mukherjee, Yijing Wang, and Zhang Rui)
- cpufreq core updates and cleanups (Viresh Kumar, Preeti U Murthy,
Rasmus Villemoes)
- cpufreq driver updates: cpufreq-cpu0/cpufreq-dt (driver name change
among other things), ppc-corenet, powernv (Viresh Kumar, Preeti U
Murthy, Shilpasri G Bhat, Lucas Stach)
- cpuidle support for DT-based idle states infrastructure, new ARM64
cpuidle driver, cpuidle core cleanups (Lorenzo Pieralisi, Rasmus
Villemoes)
- ARM big.LITTLE cpuidle driver updates: support for DT-based
initialization and Exynos5800 compatible string (Lorenzo Pieralisi,
Kevin Hilman)
- Rework of the test_suspend kernel command line argument and a new
trace event for console resume (Srinivas Pandruvada, Todd E Brandt)
- Second attempt to optimize swsusp_free() (hibernation core) to make
it avoid going through all PFNs which may be way too slow on some
systems (Joerg Roedel)
- devfreq updates (Paul Bolle, Punit Agrawal, Ãrjan Eide).
- rockchip-io Adaptive Voltage Scaling (AVS) driver and AVS entry
update in MAINTAINERS (Heiko Stübner, Kevin Hilman)
- PM core fix related to clock management (Geert Uytterhoeven)
- PM core's sysfs code cleanup (Johannes Berg)"
* tag 'pm+acpi-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (105 commits)
ACPI / fan: printk replacement
PM / clk: Fix crash in clocks management code if !CONFIG_PM_RUNTIME
PM / Domains: Rename cpu_data to cpuidle_data
cpufreq: cpufreq-dt: fix potential double put of cpu OF node
cpufreq: cpu0: rename driver and internals to 'cpufreq_dt'
PM / hibernate: Iterate over set bits instead of PFNs in swsusp_free()
cpufreq: ppc-corenet: remove duplicate update of cpu_data
ACPI / sleep: Rework the handling of ACPI GPE wakeup from suspend-to-idle
PM / sleep: Rename platform suspend/resume functions in suspend.c
PM / sleep: Export dpm_suspend_late/noirq() and dpm_resume_early/noirq()
ACPICA: Introduce acpi_enable_all_wakeup_gpes()
ACPICA: Clear all non-wakeup GPEs in acpi_hw_enable_wakeup_gpe_block()
ACPI / video: check _DOD list when creating backlight devices
PM / Domains: Move dev_pm_domain_attach|detach() to pm_domain.h
cpufreq: Replace strnicmp with strncasecmp
cpufreq: powernv: Set the cpus to nominal frequency during reboot/kexec
cpufreq: powernv: Set the pstate of the last hotplugged out cpu in policy->cpus to minimum
cpufreq: Allow stop CPU callback to be used by all cpufreq drivers
PM / devfreq: exynos: Enable building exynos PPMU as module
PM / devfreq: Export helper functions for drivers
...
Peter's new debugging tool triggers when tasks exit with !TASK_RUNNING.
The code in trace_wakeup_test_thread() also has a single schedule() call
that should be encompassed by a loop.
This cleans up the code a little to make it a bit more robust and
also makes the return exit properly with TASK_RUNNING.
Link: http://lkml.kernel.org/p/20141008135216.76142204@gandalf.local.home
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infreadead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull irq updates from Thomas Gleixner:
"The irq departement delivers:
- a cleanup series to get rid of mindlessly copied code.
- another bunch of new pointlessly different interrupt chip drivers.
Adding homebrewn irq chips (and timers) to SoCs must provide a
value add which is beyond the imagination of mere mortals.
- the usual SoC irq controller updates, IOW my second cat herding
project"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
irqchip: gic-v3: Implement CPU PM notifier
irqchip: gic-v3: Refactor gic_enable_redist to support both enabling and disabling
irqchip: renesas-intc-irqpin: Add minimal runtime PM support
irqchip: renesas-intc-irqpin: Add helper variable dev = &pdev->dev
irqchip: atmel-aic5: Add sama5d4 support
irqchip: atmel-aic5: The sama5d3 has 48 IRQs
Documentation: bcm7120-l2: Add Broadcom BCM7120-style L2 binding
irqchip: bcm7120-l2: Add Broadcom BCM7120-style Level 2 interrupt controller
irqchip: renesas-irqc: Add binding docs for new R-Car Gen2 SoCs
irqchip: renesas-irqc: Add DT binding documentation
irqchip: renesas-intc-irqpin: Document SoC-specific bindings
openrisc: Get rid of handle_IRQ
arm64: Get rid of handle_IRQ
ARM: omap2: irq: Convert to handle_domain_irq
ARM: imx: tzic: Convert to handle_domain_irq
ARM: imx: avic: Convert to handle_domain_irq
irqchip: or1k-pic: Convert to handle_domain_irq
irqchip: atmel-aic5: Convert to handle_domain_irq
irqchip: atmel-aic: Convert to handle_domain_irq
irqchip: gic-v3: Convert to handle_domain_irq
...
Pull timer updates from Thomas Gleixner:
"Nothing really exciting this time:
- a few fixlets in the NOHZ code
- a new ARM SoC timer abomination. One should expect that we have
enough of them already, but they insist on inventing new ones.
- the usual bunch of ARM SoC timer updates. That feels like herding
cats"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
clocksource: sirf: Disable counter before re-setting it
clocksource: cadence_ttc: Add support for 32bit mode
clocksource: tcb_clksrc: Sanitize IRQ request
clocksource: arm_arch_timer: Discard unavailable timers correctly
clocksource: vf_pit_timer: Support shutdown mode
ARM: meson6: clocksource: Add Meson6 timer support
ARM: meson: documentation: Add timer documentation
clocksource: sh_tmu: Document r8a7779 binding
clocksource: sh_mtu2: Document r7s72100 binding
clocksource: sh_cmt: Document SoC specific bindings
timerfd: Remove an always true check
nohz: Avoid tick's double reprogramming in highres mode
nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
Pull timer fixes from Ingo Molnar:
"Main changes:
- Fix the deadlock reported by Dave Jones et al
- Clean up and fix nohz_full interaction with arch abilities
- nohz init code consolidation/cleanup"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
nohz: nohz full depends on irq work self IPI support
nohz: Consolidate nohz full init code
arm64: Tell irq work about self IPI support
arm: Tell irq work about self IPI support
x86: Tell irq work about self IPI support
irq_work: Force raised irq work to run on irq work interrupt
irq_work: Introduce arch_irq_work_has_interrupt()
nohz: Move nohz full init call to tick init
Move the nohz_delay bit from the s390_idle data structure to the
per-cpu flags. Clear the nohz delay flag in __cpu_disable and
remove the cpu hotplug notifier that used to do this.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull networking updates from David Miller:
"Most notable changes in here:
1) By far the biggest accomplishment, thanks to a large range of
contributors, is the addition of multi-send for transmit. This is
the result of discussions back in Chicago, and the hard work of
several individuals.
Now, when the ->ndo_start_xmit() method of a driver sees
skb->xmit_more as true, it can choose to defer the doorbell
telling the driver to start processing the new TX queue entires.
skb->xmit_more means that the generic networking is guaranteed to
call the driver immediately with another SKB to send.
There is logic added to the qdisc layer to dequeue multiple
packets at a time, and the handling mis-predicted offloads in
software is now done with no locks held.
Finally, pktgen is extended to have a "burst" parameter that can
be used to test a multi-send implementation.
Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
virtio_net
Adding support is almost trivial, so export more drivers to
support this optimization soon.
I want to thank, in no particular or implied order, Jesper
Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
David Tat, Hannes Frederic Sowa, and Rusty Russell.
2) PTP and timestamping support in bnx2x, from Michal Kalderon.
3) Allow adjusting the rx_copybreak threshold for a driver via
ethtool, and add rx_copybreak support to enic driver. From
Govindarajulu Varadarajan.
4) Significant enhancements to the generic PHY layer and the bcm7xxx
driver in particular (EEE support, auto power down, etc.) from
Florian Fainelli.
5) Allow raw buffers to be used for flow dissection, allowing drivers
to determine the optimal "linear pull" size for devices that DMA
into pools of pages. The objective is to get exactly the
necessary amount of headers into the linear SKB area pre-pulled,
but no more. The new interface drivers use is eth_get_headlen().
From WANG Cong, with driver conversions (several had their own
by-hand duplicated implementations) by Alexander Duyck and Eric
Dumazet.
6) Support checksumming more smoothly and efficiently for
encapsulations, and add "foo over UDP" facility. From Tom
Herbert.
7) Add Broadcom SF2 switch driver to DSA layer, from Florian
Fainelli.
8) eBPF now can load programs via a system call and has an extensive
testsuite. Alexei Starovoitov and Daniel Borkmann.
9) Major overhaul of the packet scheduler to use RCU in several major
areas such as the classifiers and rate estimators. From John
Fastabend.
10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
Duyck.
11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
Dumazet.
12) Add Datacenter TCP congestion control algorithm support, From
Florian Westphal.
13) Reorganize sk_buff so that __copy_skb_header() is significantly
faster. From Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
netlabel: directly return netlbl_unlabel_genl_init()
net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
net: description of dma_cookie cause make xmldocs warning
cxgb4: clean up a type issue
cxgb4: potential shift wrapping bug
i40e: skb->xmit_more support
net: fs_enet: Add NAPI TX
net: fs_enet: Remove non NAPI RX
r8169:add support for RTL8168EP
net_sched: copy exts->type in tcf_exts_change()
wimax: convert printk to pr_foo()
af_unix: remove 0 assignment on static
ipv6: Do not warn for informational ICMP messages, regardless of type.
Update Intel Ethernet Driver maintainers list
bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
tipc: fix bug in multicast congestion handling
net: better IFF_XMIT_DST_RELEASE support
net/mlx4_en: remove NETDEV_TX_BUSY
3c59x: fix bad split of cpu_to_le32(pci_map_single())
net: bcmgenet: fix Tx ring priority programming
...
The pending nested sleep debugging triggered on the potential stale
TASK_INTERRUPTIBLE in this code.
While there, fix the loop such that we won't revert to a while(1)
yield() 'spin' loop if we ever get a spurious wakeup.
And fix the actual issue by properly terminating the 'wait' loop by
setting TASK_RUNNING.
Link: http://lkml.kernel.org/p/20141008165110.GA14547@worktop.programming.kicks-ass.net
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Highlights include:
Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- Don't wake tasks during connection abort
- Don't start reboot recovery if lease check fails
- fix duplicate proc entries
Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUMpFYAAoJEGcL54qWCgDywHYP/A7XNykwOGhoHVP1Cgr3xqoz
gVhAw97AEMZE8xSNVEGS++pJTe59JVzsIsYAwdHMwePV33l3zyzYorae6N9p7zWF
0xVaNQ4qNLVhbrNLAoB5KA/c3/jMnNjF5t15+8akZad5pt4kXLlhSKjyVpdEEtJE
A0eneXShMYEeLZoOJhpQt5bsw0OZ8YbWWEMjGlDqyeelvV3K1+zfivQOoyX6hS4w
XFkPEDmU7zunE/xFP9ZoUaVdLO0TvOWfEZ7STWoHm7NuWfPQiDb9w1mTnuZbZyka
ssezoGcitzwsjCcQ5e1iKTOoFRIsm/zYXFQgFQL7VFMBU1Tss9Of8047EyDkqcPF
GxctsGg0gQ2FkG7yx7JH7AKpyibOIuByQrQQ916coWSf7K0L4H4Rcky3vryroylP
1e1RI49xu215OTm+dLvlvYCv55bqCrTmaUGImZac18+ixD2eh6MNfW2ubSdxk89L
U2rTFV09Bd52N7IQOGQx1FBEI2ZnIFUV4UaFz7v+rGFxOnk6+WYe+iWyb4wC70Yc
8Jh/gTIQDd5aghql3FTieMOyfEvO6Re4pLMXmqEWMAevicx2t8DwkJriRu6X8Iy2
rlDlBPwu5QmRWC20Dc897f0VajwDtwdeB8puod7nobOWzOfx4FrNqLJ+jR3pmHUk
0otvJytqemXt+zkqqHKK
=/OQi
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- don't wake tasks during connection abort
- don't start reboot recovery if lease check fails
- fix duplicate proc entries
Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page"
* tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
NFS: Fabricate fscache server index key correctly
SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
NFSv3: Fix missing includes of nfs3_fs.h
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
NFS: avoid waiting at all in nfs_release_page when congested.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
MM: export page_wakeup functions
SCHED: add some "wait..on_bit...timeout()" interfaces.
NFS: don't use STABLE writes during writeback.
NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
rpc: Add -EPERM processing for xs_udp_send_request()
rpc: return sent and err from xs_sendpages()
lockd: Try to reconnect if statd has moved
SUNRPC: Don't wake tasks during connection abort
Fixing lease renewal
nfs: fix duplicate proc entries
pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
...
- eBPF JIT compiler for arm64
- CPU suspend backend for PSCI (firmware interface) with standard idle
states defined in DT (generic idle driver to be merged via a different
tree)
- Support for CONFIG_DEBUG_SET_MODULE_RONX
- Support for unmapped cpu-release-addr (outside kernel linear mapping)
- set_arch_dma_coherent_ops() implemented and bus notifiers removed
- EFI_STUB improvements when base of DRAM is occupied
- Typos in KGDB macros
- Clean-up to (partially) allow kernel building with LLVM
- Other clean-ups (extern keyword, phys_addr_t usage)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUNB6NAAoJEGvWsS0AyF7x22sP/1qPQvFoY71fSqTZmSY+kfgW
UMXhDFZOd+khD2TPHWptbgBRDElTQjRPHyISv/8ILKwDNoMlUDLlYkp1XPLM/nlB
ea9ou2GX8iktqgM2JF5r4vk1hjH6JqEGOUHyWKZc7ibphTVm3dhg3nWL1A4peOUG
0UyX79kl8BLAaggLSUhjtUz1GMpSNlb6Pc1ForUXaPMayBlOcVoOzh1ir7b5wb3e
IvotUY1gv+opE9uK0QPr1AJSfpCogPEfQ2TSCP8MQZjxkrEz69n0HaFvdy60rwf4
DaJiqBoQ5MSP3Bw+qvoYgyz+tfiPFAvEF+O3YQ5x3LBTteoooriFYH4mL7DsicAs
2WLor/342mHykE0bOc44/gNl8B/xaZNzvO2ezLYrjVGsiY2QHTZ7fXB8arPUvQSS
RUXVfHmcv4qthZjI17rgreBKvsfeFIMighSfvMJnVhGqDSvB8abjiPwZjzqB91Bq
pu5MDitNgR3k3ctwzRaS6JtH2CluVFv97xIS4VaD/hm3JnS5NPeTXFou3Gb3lvon
d/wXOIB3vY8FDMIt+BMCQPzWiU0liZ/sN7p1bsOmkgZ1wLOZ0nmsaHF09PDRGbtA
vifopwaw9qtNlcVrTB/rDBCDaT0Ds/mTYD/a3+ch5CYUeLmQmfW/vBMfq/3gUt65
JdI/nTVXawbl2CpBWw36
=SAfQ
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- eBPF JIT compiler for arm64
- CPU suspend backend for PSCI (firmware interface) with standard idle
states defined in DT (generic idle driver to be merged via a
different tree)
- Support for CONFIG_DEBUG_SET_MODULE_RONX
- Support for unmapped cpu-release-addr (outside kernel linear mapping)
- set_arch_dma_coherent_ops() implemented and bus notifiers removed
- EFI_STUB improvements when base of DRAM is occupied
- Typos in KGDB macros
- Clean-up to (partially) allow kernel building with LLVM
- Other clean-ups (extern keyword, phys_addr_t usage)
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (51 commits)
arm64: Remove unneeded extern keyword
ARM64: make of_device_ids const
arm64: Use phys_addr_t type for physical address
aarch64: filter $x from kallsyms
arm64: Use DMA_ERROR_CODE to denote failed allocation
arm64: Fix typos in KGDB macros
arm64: insn: Add return statements after BUG_ON()
arm64: debug: don't re-enable debug exceptions on return from el1_dbg
Revert "arm64: dmi: Add SMBIOS/DMI support"
arm64: Implement set_arch_dma_coherent_ops() to replace bus notifiers
of: amba: use of_dma_configure for AMBA devices
arm64: dmi: Add SMBIOS/DMI support
arm64: Correct ftrace calls to aarch64_insn_gen_branch_imm()
arm64:mm: initialize max_mapnr using function set_max_mapnr
setup: Move unmask of async interrupts after possible earlycon setup
arm64: LLVMLinux: Fix inline arm64 assembly for use with clang
arm64: pageattr: Correctly adjust unaligned start addresses
net: bpf: arm64: fix module memory leak when JIT image build fails
arm64: add PSCI CPU_SUSPEND based cpu_suspend support
arm64: kernel: introduce cpu_init_idle CPU operation
...
Pull ARM updates from Russell King:
"Included in these updates are:
- Performance optimisation to avoid writing the control register at
every exception.
- Use static inline instead of extern inline in ftrace code.
- Crypto ARM assembly updates for big endian
- Alignment of initrd/.init memory to page sizes when freeing to
ensure that we fully free the regions
- Add gcov support
- A couple of preparatory patches for VDSO support: use
_install_special_mapping, and randomize the sigpage placement above
stack.
- Add L2 ePAPR DT cache properties so that DT can specify the cache
geometry.
- Preparatory patch for FIQ (NMI) kernel C code for things like
spinlock lockup debug. Following on from this are a couple of my
patches cleaning up show_regs() and removing an unused (probably
since 1.x days) do_unexp_fiq() function.
- Use pr_warn() rather than pr_warning().
- A number of cleanups (smp, footbridge, return_address)"
* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (21 commits)
ARM: 8167/1: extend the reserved memory for initrd to be page aligned
ARM: 8168/1: extend __init_end to a page align address
ARM: 8169/1: l2c: parse cache properties from ePAPR definitions
ARM: 8160/1: drop warning about return_address not using unwind tables
ARM: 8161/1: footbridge: select machine dir based on ARCH_FOOTBRIDGE
ARM: 8158/1: LLVMLinux: use static inline in ARM ftrace.h
ARM: 8155/1: place sigpage at a random offset above stack
ARM: 8154/1: use _install_special_mapping for sigpage
ARM: 8153/1: Enable gcov support on the ARM architecture
ARM: Avoid writing to control register on every exception
ARM: 8152/1: Convert pr_warning to pr_warn
ARM: remove unused do_unexp_fiq() function
ARM: remove extraneous newline in show_regs()
ARM: 8150/3: fiq: Replace default FIQ handler
ARM: 8140/1: ep93xx: Enable DEBUG_LL_UART_PL01X
ARM: 8139/1: versatile: Enable DEBUG_LL_UART_PL01X
ARM: 8138/1: drop ISAR0 workaround for B15
ARM: 8136/1: sa1100: add Micro ASIC platform device
ARM: 8131/1: arm/smp: Absorb boot_secondary()
ARM: 8126/1: crypto: enable NEON SHA-384/SHA-512 for big endian
...
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
7787380336 "net_dma: mark broken"), without reports of performance
regression.
3/ Miscellaneous fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
EARN9k6ueX9PZPQrPQLm
=BBBv
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
Pull dmaengine updates from Dan Williams:
"Even though this has fixes marked for -stable, given the size and the
needed conflict resolutions this is 3.18-rc1/merge-window material.
These patches have been languishing in my tree for a long while. The
fact that I do not have the time to do proper/prompt maintenance of
this tree is a primary factor in the decision to step down as
dmaengine maintainer. That and the fact that the bulk of drivers/dma/
activity is going through Vinod these days.
The net_dma removal has not been in -next. It has developed simple
conflicts against mainline and net-next (for-3.18).
Continuing thanks to Vinod for staying on top of drivers/dma/.
Summary:
1/ Step down as dmaengine maintainer see commit 08223d80df
"dmaengine maintainer update"
2/ Removal of net_dma, as it has been marked 'broken' since 3.13
(commit 7787380336 "net_dma: mark broken"), without reports of
performance regression.
3/ Miscellaneous fixes"
* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
net: make tcp_cleanup_rbuf private
net_dma: revert 'copied_early'
net_dma: simple removal
dmaengine maintainer update
dmatest: prevent memory leakage on error path in thread
ioat: Use time_before_jiffies()
dmaengine: fix xor sources continuation
dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
drivers: dma: Include appropriate header file in dca.c
drivers: dma: Mark functions as static in dma_v3.c
dma: mv_xor: Add DMA API error checks
ioat/dca: Use dev_is_pci() to check whether it is pci device
Cheers,
Rusty.
PS. My virtio-next tree is empty: DaveM took the patches I had. There might
be a virtio-rng starvation fix, but so far it's a bit voodoo so I will
get to that in the next two days or it will wait.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUGFrvAAoJENkgDmzRrbjxOJYQALaZbTumrtX3Mo/FAtzn8d5N
8gxcqk1Mhz4lR1vPWy/YN/H2f23qb/saqLxPar8Wgou3h7N8EqSdwDqJSuvEqhG0
iEXUsNLC7BOsDkLYhdjTfZoW/lsVU/EH4bkZMSxAZI9V64phXhDYfPb5SQgJTECr
Ue6IK4ijW6zdWLstGfg/ixrIeGDUSnyiThF9O2mYVaB1D0QkLDIAZxbjZJgfFfut
PwO33/sEV4pceTpkmxFKl/OiS+obi/VbDixjSCcO+jaBd1pVxH9fhhKREStOhN4z
88z5ADR71RH6so9TQTwIIcgb2Hon5d+3RVMB6CxuvKs9NmHSXDiQyZvG9J/jiSdm
KrPKSiVwGGwJSwxXTm8CDaz6Oj0ibDXBIzv/vYI22sR7u8PmRQFvL3O1VrW+KDnE
yoG75S9DHzSQ1183xFFFTt4FBRm/4XKyVs+F6YqYkchLigrUfQMCGb1cmZyE5y7K
bgNyonu0m/ItoQmekoDgYqvSjwdguaJ35XCW55GrKJ84JDHBaw3SpPdEfjAS8FsH
aT5o2oernvwRG6gsX9858RvB/uo1UKwHv1waDfV4cqNjMm5Ko+Yr6OIdQvBQiq07
cFkVmkrMtEyX19QyIGW3QSbFL1lr3X5cC5glzEeKY941yZbTluSsNuMlMPT1+IMx
NOUbh0aG8B8ZaMZPFNLi
=QzCn
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"Nothing major: support for compressing modules, and auto-tainting
params.
PS. My virtio-next tree is empty: DaveM took the patches I had. There
might be a virtio-rng starvation fix, but so far it's a bit voodoo
so I will get to that in the next two days or it will wait"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
moduleparam: Resolve missing-field-initializer warning
kbuild: handle module compression while running 'make modules_install'.
modinst: wrap long lines in order to enhance cmd_modules_install
modsign: lookup lines ending in .ko in .mod files
modpost: simplify file name generation of *.mod.c files
modpost: reduce visibility of symbols and constify r/o arrays
param: check for tainting before calling set op.
drm/i915: taint the kernel if unsafe module parameters are set
module: add module_param_unsafe and module_param_named_unsafe
module: make it possible to have unsafe, tainting module params
module: rename KERNEL_PARAM_FL_NOARG to avoid confusion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJUL0J0AAoJEA7Zo9+K/4c9w40P/iMFPfCethdBtPz5rI88CVr2
7yU99TdbEPoRJm+rU4ohvHdB73p2KWINIKvpSThvegvjXbEcKxQkdpVWHsFJZeHS
bZiYmhjxdCBvJGLrYo5IwqH0PrSjokTPzMUekUCk7BkUKNJRaDjfUBHvUmKsinUR
dQL+3KE3edy6W3DL+FOd0QZwSOgmOfEibTWpfmg+n16kFNa75Kg/QLwjYRvtQplP
eElywDZN07IhAeBFqKhKvlKmDSAeqMd8RfoPPo9Ts+reeIrWYjVNbl9ISOqXqy2x
JoLeZQmwSXj/C9Ehr5e+aId2eO8In5xueQfXP8SS8dCC7VLwRbnNgyAQQZEslEBk
QH0GhT6GqTamBdiNI3I+usfs65cEaialXh2afcoLwGS/iGD8MhZ8Dt+m4iyXNxEZ
kT9VA4974mPjJ1g0mDDnYIxNjxF43m+SD5K1sR/XGpMcA8NdqMUmvKNcbePCobVa
WTutIemQqGipNeWE94XwZEbc0B+aWwH7eiZOBMVGhWsHInd7QeTBTbfZlctyBkzf
AswgsFjC5FW05CWK6J1Lf/UI1FD9PmHMKpmQUPED1+7okDTfqGjKjdREWgZSixUt
LIRfWqWEaNpRRBFbDyt0C+F4pBRPLiRDaOyNhwEdtXuVGKRXb1G3qX7nFOJAZo6G
GDTZo9iIRNSfm/M4tJ+n
=2VyW
-----END PGP SIGNATURE-----
Merge tag 'tiny/for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux
Pull "tinification" patches from Josh Triplett.
Work on making smaller kernels.
* tag 'tiny/for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux:
bloat-o-meter: Ignore syscall aliases SyS_ and compat_SyS_
mm: Support compiling out madvise and fadvise
x86: Support compiling out human-friendly processor feature names
x86: Drop support for /proc files when !CONFIG_PROC_FS
x86, boot: Don't compile early_serial_console.c when !CONFIG_EARLY_PRINTK
x86, boot: Don't compile aslr.c when !CONFIG_RANDOMIZE_BASE
x86, boot: Use the usual -y -n mechanism for objects in vmlinux
x86: Add "make tinyconfig" to configure the tiniest possible kernel
x86, platform, kconfig: move kvmconfig functionality to a helper
* pm-genirq:
PM / genirq: Document rules related to system suspend and interrupts
PCI / PM: Make PCIe PME interrupts wake up from suspend-to-idle
x86 / PM: Set IRQCHIP_SKIP_SET_WAKE for IOAPIC IRQ chip objects
genirq: Simplify wakeup mechanism
genirq: Mark wakeup sources as armed on suspend
genirq: Create helper for flow handler entry check
genirq: Distangle edge handler entry
genirq: Avoid double loop on suspend
genirq: Move MASK_ON_SUSPEND handling into suspend_device_irqs()
genirq: Make use of pm misfeature accounting
genirq: Add sanity checks for PM options on shared interrupt lines
genirq: Move suspend/resume logic into irq/pm code
PM / sleep: Mechanism for aborting system suspends unconditionally
Tidy up and use cond_resched_rcu_qs when calling cond_resched and
reporting potential quiescent state to RCU. Splitting this change in
this way allows easy backporting to -stable for kernel versions not
having cond_resched_rcu_qs().
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Similar to the stop_machine deadlock scenario on !PREEMPT kernels
addressed in b22ce2785d "workqueue: cond_resched() after processing
each work item", kworker threads requeueing back-to-back with zero jiffy
delay can stall RCU. The cond_resched call introduced in that fix will
yield only iff there are other higher priority tasks to run, so force a
quiescent RCU state between work items.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Link: https://lkml.kernel.org/r/20140926105227.01325697@jlaw-desktop.mno.stratus.com
Link: https://lkml.kernel.org/r/20140929115445.40221d8e@jlaw-desktop.mno.stratus.com
Fixes: b22ce2785d ("workqueue: cond_resched() after processing each work item")
Cc: <stable@vger.kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
often in the ring buffer. At first I thought it had to do with some
of the changes I was working on, but then testing something else I
realized that the bug was in 3.17 itself. I ran several bisects as the
bug was not very reproducible, and finally came up with the commit
that I could reproduce easily within a few minutes, and without the change
I could run the tests over an hour without issue. The change fit the
bug and I figured out a fix. That bad commit was:
Commit 651e22f270 "ring-buffer: Always reset iterator to reader page"
This commit fixed a bug, but in the process created another one. It used
the wrong value as the cached value that is used to see if things changed
while an iterator was in use. This made it look like a change always
happened, and could cause the iterator to go into an infinite loop.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJULvrkAAoJEKQekfcNnQGuD+sH/iHPE2qb2ojCP9+hqOpszdd1
d8rN8BNZsDlJxfWQELw2vGVXTmeW7txW5DFWQ3I8qSSjwYa6l27M4mHsw2QLagtw
kIrcazis3IAcYCH8OE4ruD5nAGYLFqRIt0MOa/NAJD0r00xM7nvOhII2+6uAXF+A
1JbQDRq8eleCKMUMV0XchqWx6pYTXL8cLh1YEXZ0BTUFKIz+y22HjWnMf+odDhLB
okQic67/+i7mJDAAW4U+pyevd0QBZdDOohjQtbj+irv2pb7WtWqylKcYhAYSpgsy
MtPzzYyPDs/aHLNcnIJVdVtbKfNXsaHuCgEvKKgLXnKMMcS5UxSIxj+Q1IxSIOM=
=B7HS
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace ring buffer iterator fix from Steven Rostedt:
"While testing some new changes for 3.18, I kept hitting a bug every so
often in the ring buffer. At first I thought it had to do with some
of the changes I was working on, but then testing something else I
realized that the bug was in 3.17 itself. I ran several bisects as
the bug was not very reproducible, and finally came up with the commit
that I could reproduce easily within a few minutes, and without the
change I could run the tests over an hour without issue. The change
fit the bug and I figured out a fix. That bad commit was:
Commit 651e22f270 "ring-buffer: Always reset iterator to reader page"
This commit fixed a bug, but in the process created another one. It
used the wrong value as the cached value that is used to see if things
changed while an iterator was in use. This made it look like a change
always happened, and could cause the iterator to go into an infinite
loop"
* tag 'trace-fixes-v3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Fix infinite spin in reading buffer
Commit f0bab73cb5 ("locking/lockdep: Restrict the use of recursive
read_lock() with qrwlock") changed lockdep to try and conform to the
qrwlock semantics which differ from the traditional rwlock semantics.
In particular qrwlock is fair outside of interrupt context, but in
interrupt context readers will ignore all fairness.
The problem modeling this is that read and write side have different
lock state (interrupts) semantics but we only have a single
representation of these. Therefore lockdep will get confused, thinking
the lock can cause interrupt lock inversions.
So revert it for now; the old rwlock semantics were already imperfectly
modeled and the qrwlock extra won't fit either.
If we want to properly fix this, I think we need to resurrect the work
by Gautham did a few years ago that split the read and write state of
locks:
http://lwn.net/Articles/332801/
FWIW the locking selftest that would've failed (and was reported by
Borislav earlier) is something like:
RL(X1); /* IRQ-ON */
LOCK(A);
UNLOCK(A);
RU(X1);
IRQ_ENTER();
RL(X1); /* IN-IRQ */
RU(X1);
IRQ_EXIT();
At which point it would report that because A is an IRQ-unsafe lock we
can suffer the following inversion:
CPU0 CPU1
lock(A)
lock(X1)
lock(A)
<IRQ>
lock(X1)
And this is 'wrong' because X1 can recurse (assuming the above lock are
in fact read-lock) but lockdep doesn't know about this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: ego@linux.vnet.ibm.com
Cc: bp@alien8.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20140930132600.GA7444@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 9b0fc9c09f ("rwsem: skip initial trylock in rwsem_down_write_failed")
checks for if there are known active lockers in order to avoid write trylocking
using expensive cmpxchg() when it likely wouldn't get the lock.
However, a subsequent patch was added such that we directly
check for sem->count == RWSEM_WAITING_BIAS right before trying
that cmpxchg().
Thus, commit 9b0fc9c09f now just adds overhead.
This patch modifies it so that we only do a check for if
count == RWSEM_WAITING_BIAS.
Also, add a comment on why we do an "extra check" of count
before the cmpxchg().
Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1410913017.2447.22.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it
is not enough. We should use either rcu_read_lock_sched() or preempt_disable().
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Fixes: 66339c31bc "sched: Use dl_bw_of() under RCU read lock"
Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
if this is necessary.
Furthermore, a higher priority class task may be current on dest rq,
we shouldn't disturb it.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On 32 bit systems cmpxchg cannot handle 64 bit values, so
some additional magic is required to allow a 32 bit system
with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.
Make sure the correct cmpxchg function is used when doing
an atomic swap of a cputime_t.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linux390@de.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit caeb178c60 ("sched/fair: Make update_sd_pick_busiest() ...")
sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
but is only more loaded than others. This change has been introduced to ensure
a better load balance in system that are not overloaded but as a side effect,
it can also generate useless active migration between groups.
Let take the example of 3 tasks on a quad cores system. We will always have an
idle core so the load balance will find a busiest group (core) whenever an ILB
is triggered and it will force an active migration (once above
nr_balance_failed threshold) so the idle core becomes busy but another core
will become idle. With the next ILB, the freshly idle core will try to pull the
task of a busy CPU.
The number of spurious active migration is not so huge in quad core system
because the ILB is not triggered so much. But it becomes significant as soon as
you have more than one sched_domain level like on a dual cluster of quad cores
where the ILB is triggered every tick when you have more than 1 busy_cpu
We need to ensure that the migration generate a real improveùent and will not
only move the avg_load imbalance on another CPU.
Before caeb178c60, the filtering of such use
case was ensured by the following test in f_b_g:
if ((local->idle_cpus < busiest->idle_cpus) &&
busiest->sum_nr_running <= busiest->group_weight)
This patch modified the condition to take into account situation where busiest
group is not overloaded: If the diff between the number of idle cpus in 2
groups is less than or equal to 1 and the busiest group is not overloaded,
moving a task will not improve the load balance but just move it.
A test with sysbench on a dual clusters of quad cores gives the following
results:
command: sysbench --test=cpu --num-threads=5 --max-time=5 run
The HZ is 200 which means that 1000 ticks has fired during the test.
With Mainline, perf gives the following figures:
Samples: 727 of event 'sched:sched_migrate_task'
Event count (approx.): 727
Overhead Command Shared Object Symbol
........ ............... ............. ..............
12.52% migration/1 [unknown] [.] 00000000
12.52% migration/5 [unknown] [.] 00000000
12.52% migration/7 [unknown] [.] 00000000
12.10% migration/6 [unknown] [.] 00000000
11.83% migration/0 [unknown] [.] 00000000
11.83% migration/3 [unknown] [.] 00000000
11.14% migration/4 [unknown] [.] 00000000
10.87% migration/2 [unknown] [.] 00000000
2.75% sysbench [unknown] [.] 00000000
0.83% swapper [unknown] [.] 00000000
0.55% ktps65090charge [unknown] [.] 00000000
0.41% mmcqd/1 [unknown] [.] 00000000
0.14% perf [unknown] [.] 00000000
With this patch, perf gives the following figures
Samples: 20 of event 'sched:sched_migrate_task'
Event count (approx.): 20
Overhead Command Shared Object Symbol
........ ............... ............. ..............
80.00% sysbench [unknown] [.] 00000000
10.00% swapper [unknown] [.] 00000000
5.00% ktps65090charge [unknown] [.] 00000000
5.00% migration/1 [unknown] [.] 00000000
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process.
This is bad and might explain some outstanding fuzzer failures ...
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20140929101201.GE5430@worktop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The idiot who did 4a1c0f262f ("perf: Fix lockdep warning on process exit")
forgot to pay attention and fix all similar cases. Do so now.
In particular, unclone_ctx() must be called while holding ctx->lock,
therefore all such sites are broken for the same reason. Pull the
put_ctx() call out from under ctx->lock.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Probably-also-reported-by: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 4a1c0f262f ("perf: Fix lockdep warning on process exit")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140930172308.GI4241@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process. This is bad..
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 651e22f270 "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.
A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.
The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.
Commit 651e22f270 changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.
Fixes: 651e22f270 "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
drivers/net/usb/r8152.c
net/netfilter/nfnetlink.c
Both r8152 and nfnetlink conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to ARM, AArch64 is generating $x and $d syms... which isn't
terribly helpful when looking at %pF output and the like. Filter those
out in kallsyms, modpost and when looking at module symbols.
Seems simplest since none of these check EM_ARM anyway, to just add it
to the strchr used, rather than trying to make things overly
complicated.
initcall_debug improves:
dmesg_before.txt: initcall $x+0x0/0x154 [sg] returned 0 after 26331 usecs
dmesg_after.txt: initcall init_sg+0x0/0x154 [sg] returned 0 after 15461 usecs
Signed-off-by: Kyle McMartin <kyle@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
consider C program represented in eBPF:
int filter(int arg)
{
int a, b, c, *ptr;
if (arg == 1)
ptr = &a;
else if (arg == 2)
ptr = &b;
else
ptr = &c;
*ptr = 0;
return 0;
}
eBPF verifier has to follow all possible paths through the program
to recognize that '*ptr = 0' instruction would be safe to execute
in all situations.
It's doing it by picking a path towards the end and observes changes
to registers and stack at every insn until it reaches bpf_exit.
Then it comes back to one of the previous branches and goes towards
the end again with potentially different values in registers.
When program has a lot of branches, the number of possible combinations
of branches is huge, so verifer has a hard limit of walking no more
than 32k instructions. This limit can be reached and complex (but valid)
programs could be rejected. Therefore it's important to recognize equivalent
verifier states to prune this depth first search.
Basic idea can be illustrated by the program (where .. are some eBPF insns):
1: ..
2: if (rX == rY) goto 4
3: ..
4: ..
5: ..
6: bpf_exit
In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
Since insn#2 is a branch the verifier will remember its state in verifier stack
to come back to it later.
Since insn#4 is marked as 'branch target', the verifier will remember its state
in explored_states[4] linked list.
Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
will continue.
Without search pruning optimization verifier would have to walk 4, 5, 6 again,
effectively simulating execution of insns 1, 2, 4, 5, 6
With search pruning it will check whether state at #4 after jumping from #2
is equivalent to one recorded in explored_states[4] during first pass.
If there is an equivalent state, verifier can prune the search at #4 and declare
this path to be safe as well.
In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
and 1, 2, 4 insns produces equivalent registers and stack.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing implementation of swsusp_free iterates over all
pfns in the system and checks every bit in the two memory
bitmaps.
This doesn't scale very well with large numbers of pfns,
especially when the bitmaps are not populated very densly.
Change the algorithm to iterate over the set bits in the
bitmaps instead to make it scale better in large memory
configurations.
Also add a memory_bm_clear_current() helper function that
clears the bit for the last position returned from the
memory bitmap.
This new version adds a !NULL check for the memory bitmaps
before they are walked. Not doing so causes a kernel crash
when the bitmaps are NULL.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ACPI GPE wakeup from suspend-to-idle is currently based on using
the IRQF_NO_SUSPEND flag for the ACPI SCI, but that is problematic
for a couple of reasons. First, in principle the ACPI SCI may be
shared and IRQF_NO_SUSPEND does not really work well with shared
interrupts. Second, it may require the ACPI subsystem to special-case
the handling of device notifications depending on whether or not
they are received during suspend-to-idle in some places which would
lead to fragile code. Finally, it's better the handle ACPI wakeup
interrupts consistently with wakeup interrupts from other sources.
For this reason, remove the IRQF_NO_SUSPEND flag from the ACPI SCI
and use enable_irq_wake()/disable_irq_wake() with it instead, which
requires two additional platform hooks to be added to struct
platform_freeze_ops.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename several local functions related to platform handling during
system suspend resume in suspend.c so that their names better
reflect their roles.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subsequent change sets will add platform-related operations between
dpm_suspend_late() and dpm_suspend_noirq() as well as between
dpm_resume_noirq() and dpm_resume_early() in suspend_enter(), so
export these functions for suspend_enter() to be able to call them
separately and split the invocations of dpm_suspend_end() and
dpm_resume_start() in there accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove some unnecessary ones and explicitly include rwsem.h
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Its quite easy to get mixed up with the names -- 'torture_spinlock_irq'
is not actually a valid spinlock name.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Add a "rw_lock" torture test to stress kernel rwlocks and their irq
variant. Reader critical regions are 5x longer than writers. As such
a similar ratio of lock acquisitions is seen in the statistics. In the
case of massive contention, both hold the lock for 1/10 of a second.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.
This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.
Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():
https://lkml.org/lkml/2014/9/3/177
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull cgroup fixes from Tejun Heo:
"This is quite late but these need to be backported anyway.
This is the fix for a long-standing cpuset bug which existed from
2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the
task's memory allocation behavior according to the settings of the
cpuset it belongs to; unfortunately, when those flags have to be
changed, cpuset did so directly even whlie the target task is running,
which is obviously racy as task->flags may be modified by the task
itself at any time. This obscure bug manifested as corrupt
PF_USED_MATH flag leading to a weird crash.
The bug is fixed by moving the flag to task->atomic_flags. The first
two are prepatory ones to help defining atomic_flags accessors and the
third one is the actual fix"
* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
sched: add macros to define bitops for task atomic flags
sched: fix confusing PFA_NO_NEW_PRIVS constant
1.
the library includes a trivial set of BPF syscall wrappers:
int bpf_create_map(int key_size, int value_size, int max_entries);
int bpf_update_elem(int fd, void *key, void *value);
int bpf_lookup_elem(int fd, void *key, void *value);
int bpf_delete_elem(int fd, void *key);
int bpf_get_next_key(int fd, void *key, void *next_key);
int bpf_prog_load(enum bpf_prog_type prog_type,
const struct sock_filter_int *insns, int insn_len,
const char *license);
bpf_prog_load() stores verifier log into global bpf_log_buf[] array
and BPF_*() macros to build instructions
2.
test stubs configure eBPF infra with 'unspec' map and program types.
These are fake types used by user space testsuite only.
3.
verifier tests valid and invalid programs and expects predefined
error log messages from kernel.
40 tests so far.
$ sudo ./test_verifier
#0 add+sub+mul OK
#1 unreachable OK
#2 unreachable2 OK
#3 out of range jump OK
#4 out of range jump2 OK
#5 test1 ld_imm64 OK
...
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds verifier core which simulates execution of every insn and
records the state of registers and program stack. Every branch instruction seen
during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
it pops the state from the stack and continues until it reaches BPF_EXIT again.
For program:
1: bpf_mov r1, xxx
2: if (r1 == 0) goto 5
3: bpf_mov r0, 1
4: goto 6
5: bpf_mov r0, 2
6: bpf_exit
The verifier will walk insns: 1, 2, 3, 4, 6
then it will pop the state recorded at insn#2 and will continue: 5, 6
This way it walks all possible paths through the program and checks all
possible values of registers. While doing so, it checks for:
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
- instruction encoding is not using reserved fields
Kernel subsystem configures the verifier with two callbacks:
- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
that provides information to the verifer which fields of 'ctx'
are accessible (remember 'ctx' is the first argument to eBPF program)
- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
returns argument constraints of kernel helper functions that eBPF program
may call, so that verifier can checks that R1-R5 types match the prototype
More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
check that control flow graph of eBPF program is a directed acyclic graph
check_cfg() does:
- detect loops
- detect unreachable instructions
- check that program terminates with BPF_EXIT insn
- check that all branches are within program boundary
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
to refer to process-local map_fd. Scan the program for such instructions and
if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
by verifier to check access to maps in bpf_map_lookup/update() calls.
If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
BPF_PSEUDO_MAP_FD flag.
Note that eBPF interpreter is generic and knows nothing about pseudo insns.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add optional attributes for BPF_PROG_LOAD syscall:
union bpf_attr {
struct {
...
__u32 log_level; /* verbosity level of eBPF verifier */
__u32 log_size; /* size of user buffer */
__aligned_u64 log_buf; /* user supplied 'char *buffer' */
};
};
when log_level > 0 the verifier will return its verification log in the user
supplied buffer 'log_buf' which can be used by program author to analyze why
verifier rejected given program.
'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
provides several examples of these messages, like the program:
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
BPF_LD_MAP_FD(BPF_REG_1, 0),
BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
BPF_EXIT_INSN(),
will be rejected with the following multi-line message in log_buf:
0: (7a) *(u64 *)(r10 -8) = 0
1: (bf) r2 = r10
2: (07) r2 += -8
3: (b7) r1 = 0
4: (85) call 1
5: (15) if r0 == 0x0 goto pc+1
R0=map_ptr R10=fp
6: (7a) *(u64 *)(r0 +4) = 0
misaligned access off 4 size 8
The format of the output can change at any time as verifier evolves.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch adds all of eBPF verfier documentation and empty bpf_check()
The end goal for the verifier is to statically check safety of the program.
Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
More details in Documentation/networking/filter.txt
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in native eBPF programs userspace is using pseudo BPF_CALL instructions
which encode one of 'enum bpf_func_id' inside insn->imm field.
Verifier checks that program using correct function arguments to given func_id.
If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
replacing func_id with in-kernel function pointer.
eBPF interpreter just calls the function.
In-kernel eBPF users continue to use generic BPF_CALL.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF programs are similar to kernel modules. They are loaded by the user
process and automatically unloaded when process exits. Each eBPF program is
a safe run-to-completion set of instructions. eBPF verifier statically
determines that the program terminates and is safe to execute.
The following syscall wrapper can be used to load the program:
int bpf_prog_load(enum bpf_prog_type prog_type,
const struct bpf_insn *insns, int insn_cnt,
const char *license)
{
union bpf_attr attr = {
.prog_type = prog_type,
.insns = ptr_to_u64(insns),
.insn_cnt = insn_cnt,
.license = ptr_to_u64(license),
};
return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}
where 'insns' is an array of eBPF instructions and 'license' is a string
that must be GPL compatible to call helper functions marked gpl_only
Upon succesful load the syscall returns prog_fd.
Use close(prog_fd) to unload the program.
User space tests and examples follow in the later patches
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
The maps are accessed from user space via BPF syscall, which has commands:
- create a map with given type and attributes
fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
returns fd or negative error
- lookup key in a given map referenced by fd
err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key, attr->value
returns zero and stores found elem into value or negative error
- create or update key/value pair in a given map
err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key, attr->value
returns zero or negative error
- find and delete element by key in a given map
err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key
- iterate map elements (based on input key return next_key)
err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
using attr->map_fd, attr->key, attr->next_key
- close(fd) deletes the map
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
done as separate commit to ease conflict resolution
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF syscall is a multiplexor for a range of different operations on eBPF.
This patch introduces syscall with single command to create a map.
Next patch adds commands to access maps.
'maps' is a generic storage of different types for sharing data between kernel
and userspace.
Userspace example:
/* this syscall wrapper creates a map with given type and attributes
* and returns map_fd on success.
* use close(map_fd) to delete the map
*/
int bpf_create_map(enum bpf_map_type map_type, int key_size,
int value_size, int max_entries)
{
union bpf_attr attr = {
.map_type = map_type,
.key_size = key_size,
.value_size = value_size,
.max_entries = max_entries
};
return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}
'union bpf_attr' is backwards compatible with future extensions.
More details in Documentation/networking/filter.txt and in manpage
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable gcov support for ARM based on original patches by David
Singleton and George G. Davis
Riku - updated to patch to current mainline kernel. The patch
has been submitted in 2010, 2012 - for symmetry, now in 2014 too.
https://lwn.net/Articles/390419/http://marc.info/?l=linux-arm-kernel&m=133823081813044
v2: remove arch/arm/kernel from gcov disabled files
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Signed-off-by: Vincent Sanders <vincent.sanders@collabora.co.uk>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Various drivers implement architecture and/or device specific means to
restart (reset) the system. Various mechanisms have been implemented to
support those schemes. The best known mechanism is arm_pm_restart, which
is a function pointer to be set either from platform specific code or from
drivers. Another mechanism is to use hardware watchdogs to issue a reset;
this mechanism is used if there is no other method available to reset a
board or system. Two examples are alim7101_wdt, which currently uses the
reboot notifier to trigger a reset, and moxart_wdt, which registers the
arm_pm_restart function.
The existing mechanisms have a number of drawbacks. Typically only one
scheme to restart the system is supported (at least if arm_pm_restart is
used). At least in theory there can be multiple means to restart the
system, some of which may be less desirable (for example one mechanism may
only reset the CPU, while another may reset the entire system). Using
arm_pm_restart can also be racy if the function pointer is set from a
driver, as the driver may be in the process of being unloaded when
arm_pm_restart is called. Using the reboot notifier is always racy, as it
is unknown if and when other functions using the reboot notifier have
completed execution by the time the watchdog fires.
Introduce a system restart handler call chain to solve the described
problems. This call chain is expected to be executed from the
architecture specific machine_restart() function. Drivers providing
system restart functionality (such as the watchdog drivers mentioned
above) are expected to register with this call chain. By using the
priority field in the notifier block, callers can control restart handler
execution sequence and thus ensure that the restart handler with the
optimal restart capabilities for a given system is called first.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Heiko Stuebner <heiko@sntech.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jonas Jensen <jonas.jensen@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Tomasz Figa <t.figa@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 0c7bf3e8ca.
If there are child cgroups in the cgroupfs and then we umount it,
the superblock will be destroyed but the cgroup_root will be kept
around. When we mount it again, cgroup_mount() will find this
cgroup_root and allocate a new sb for it.
So with this commit we will be trapped in a dead loop in the case
described above, because kernfs_pin_sb() keeps returning NULL.
Currently I don't see how we can avoid using both pinned_sb and
new_sb, so just revert it.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Revert of a recent hibernation core commit that introduced
a NULL pointer dereference during resume for at least one user
(Rafael J Wysocki).
- Fix for the ACPI LPSS (Low-Power Subsystem) driver to disable
asynchronous PM callback execution for LPSS devices during system
suspend/resume (introduced in 3.16) which turns out to break
ordering expectations on some systems. From Fu Zhonghui.
- cpufreq core fix related to the handling of sysfs nodes during
system suspend/resume that has been broken for intel_pstate
since 3.15 from Lan Tianyu.
- Restore the generation of "online" uevents for ACPI container
devices that was removed in 3.14, but some user space utilities
turn out to need them (Rafael J Wysocki).
- The cpufreq core fails to release a lock in an error code path
after changes made in 3.14. Fix from Prarit Bhargava.
- ACPICA and ACPI/GPIO fixes to make the handling of ACPI GPIO
operation regions (which means AML using GPIOs) work correctly
in all cases from Bob Moore and Srinivas Pandruvada.
- Fix for a wrong sign of the ACPI core's create_modalias() return
value in case of an error from Mika Westerberg.
- ACPI backlight blacklist entry for ThinkPad X201s from Aaron Lu.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUJJGgAAoJEILEb/54YlRxt3kP/19OjVjGK/lFKJk4LCmQ77k5
6DDF7/clNJmYBkKBXGdyqqRVdDUXjRuHS1Yd78zWMmwdLtdOcyI+wBjG1w0mMU7o
vAYvXkIks9fCeKBRHSlqdtQROFf3+bxothKD8JGTONA5z4Fih40fqsnuSW8G7uJs
iTEQQK7L2uPJ+w1OnltwN6eNgzN5KqfxgxI+L6DhEMRjWXRHuhfRZorVIjvz+ALV
Fjm8shhjnhQKzS2zuv5PZ5gGM7zZBH7hy7kd4aDYsbppOLAB2pMOwVs0sgC1Xcbv
teyWkyzmhix2Z1bX9wwia5FfMgbnY2leejJN7mukKzHz8CQ1vxS98Sji2uviIAej
Ctp6GKjuemGvjryjbkstD6r3KYS8CuWAL++YwlamqSa0eWBuM+aD9YqGj4i6ntbU
8BFT5KXauOIsA5U51zC8wNUDHoTgBcvoN99zNIM1jIF81M7wuQrXUzJLXBStuSlR
/bDpExwxHt7I6MeUfRTjg37ApVNRAiStw32+DfsKAj4HLsqTkGs1879Paxf30T0f
Z2SlYr5Jeusu5u9DNhk7MG21A+m46R0jjLd1OKBbf2mrtfQfdKCo6szGR7vjEMZC
aGIlwtIA4iS4MN3UAyqOW3SxIPT2SxqPXzG/z27hRN5MUsGNWiClzcUsaaHoHmpp
GlbY/BvDYfur4NBeCSli
=SzQq
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are regression fixes (ACPI hotplug, cpufreq, hibernation, ACPI
LPSS driver), fixes for stuff that never worked correctly (ACPI GPIO
support in some cases and a wrong sign of an error code in the ACPI
core in one place), and one blacklist item for ACPI backlight
handling.
Specifics:
- Revert of a recent hibernation core commit that introduced a NULL
pointer dereference during resume for at least one user (Rafael J
Wysocki).
- Fix for the ACPI LPSS (Low-Power Subsystem) driver to disable
asynchronous PM callback execution for LPSS devices during system
suspend/resume (introduced in 3.16) which turns out to break
ordering expectations on some systems. From Fu Zhonghui.
- cpufreq core fix related to the handling of sysfs nodes during
system suspend/resume that has been broken for intel_pstate since
3.15 from Lan Tianyu.
- Restore the generation of "online" uevents for ACPI container
devices that was removed in 3.14, but some user space utilities
turn out to need them (Rafael J Wysocki).
- The cpufreq core fails to release a lock in an error code path
after changes made in 3.14. Fix from Prarit Bhargava.
- ACPICA and ACPI/GPIO fixes to make the handling of ACPI GPIO
operation regions (which means AML using GPIOs) work correctly in
all cases from Bob Moore and Srinivas Pandruvada.
- Fix for a wrong sign of the ACPI core's create_modalias() return
value in case of an error from Mika Westerberg.
- ACPI backlight blacklist entry for ThinkPad X201s from Aaron Lu"
* tag 'pm+acpi-3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
gpio / ACPI: Use pin index and bit length
ACPICA: Update to GPIO region handler interface.
ACPI / platform / LPSS: disable async suspend/resume of LPSS devices
cpufreq: release policy->rwsem on error
cpufreq: fix cpufreq suspend/resume for intel_pstate
ACPI / scan: Correct error return value of create_modalias()
ACPI / video: disable native backlight for ThinkPad X201s
ACPI / hotplug: Generate online uevents for ACPI containers
In commit c1221321b7
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.
Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.
Here's the full report:
https://lkml.org/lkml/2014/9/19/230
To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b9 ("cpusets: update tasks' page/slab spread flags in time")
Cc: <stable@vger.kernel.org> # 2.6.31+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Also adds a class type PM_QOS_SUM that aggregates the values by summing them.
It can be used by memory controllers to calculate the optimum clock frequency
based on the bandwidth needs of the different memory clients.
Signed-off-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.
Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.
This patch adds @flags to percpu_ref_init() and implements the
following flags.
* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained
These flags should be able to serve the above two use cases.
v2: target_core_tpg.c conversion was missing. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
This reverts commit 1f9a7268c6.
With the fix of the initial state for the cloned event we now correctly
handle the error described in:
1f9a7268c6 perf: Do not allow optimized switch for non-cloned events
so we can revert it.
I made an automated test for this, but its not suitable for automated
perf tests framework. It needs to be customized for each machine (the
more cpu the higher numbers for GROUPS/WORKERS/BYTES) and it could take
longer time to hit the issue.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140910143535.GD2409@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we initialize the child event based on the original
parent state. This is wrong, because the original parent event
(and its state) is not related to current fork and also could
be already gone.
We need to initialize the child state based on the immediate
parent event state.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1410520708-19275-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we return POLLHUP in event polling if the monitored
process is done, but we didn't consider possible children,
that might be still running and producing data.
Before returning POLLHUP making sure that:
1) the monitored task has exited and that
2) we don't have any children to monitor
Also adding parent wakeup when the child event is gone.
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410520708-19275-1-git-send-email-jolsa@kernel.org
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some time ago PREEMPT_NEED_RESCHED was implemented,
so reschedule technics is a little more difficult now.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183642.11015.66039.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Probability of use-after-free isn't zero in this place.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nothing is locked there, so label's name only confuses a reader.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
dl_bw_of() dereferences rq->rd which has to have RCU read lock held.
Probability of use-after-free isn't zero here.
Also add lockdep assert into dl_bw_cpus().
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
read_lock_irqsave(tasklist_lock) in print_rq() looks strange. We do
not need to disable irqs, and they are already disabled by the caller.
And afaics this lock buys nothing, we can rely on rcu_read_lock().
In this case it makes sense to also move rcu_read_lock/unlock from
the caller to print_rq().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921193341.GA28628@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. read_lock(tasklist_lock) does not need to disable irqs.
2. ->mm != NULL is a common mistake, use PF_KTHREAD.
3. The second ->mm check can be simply removed.
4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) +
__task_rq_lock().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tg_has_rt_tasks() wants to find an RT task in this task_group, but
task_rq(p)->rt.tg wrongly checks the root rt_rq.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The code in find_idlest_cpu() looks for the CPU with the smallest load.
However, if multiple CPUs are idle, the first idle CPU is selected
irrespective of the depth of its idle state.
Among the idle CPUs we should pick the one with with the shallowest idle
state, or the latest to have gone idle if all idle CPUs are in the same
state. The later applies even when cpuidle is configured out.
This patch doesn't cover the following issues:
- The idle exit latency of a CPU might be larger than the time needed
to migrate the waking task to an already running CPU with sufficient
capacity, and therefore performance would benefit from task packing
in such case (in most cases task packing is about power saving).
- Some idle states have a non negligible and non abortable entry latency
which needs to run to completion before the exit latency can start.
A concurrent patch series is making this info available to the cpuidle
core. Once available, the entry latency with the idle timestamp could
determine when the exit latency may be effective.
Those issues will be handled in due course. In the mean time, what
is implemented here should improve things already compared to the current
state of affairs.
Based on an initial patch from Daniel Lezcano.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the cpu enters idle, it stores the cpuidle state pointer in its
struct rq instance which in turn could be used to make a better decision
when balancing tasks.
As soon as the cpu exits its idle state, the struct rq reference is
cleared.
There are a couple of situations where the idle state pointer could be changed
while it is being consulted:
1. For x86/acpi with dynamic c-states, when a laptop switches from battery
to AC that could result on removing the deeper idle state. The acpi driver
triggers:
'acpi_processor_cst_has_changed'
'cpuidle_pause_and_lock'
'cpuidle_uninstall_idle_handler'
'kick_all_cpus_sync'.
All cpus will exit their idle state and the pointed object will be set to
NULL.
2. The cpuidle driver is unloaded. Logically that could happen but not
in practice because the drivers are always compiled in and 95% of them are
not coded to unregister themselves. In any case, the unloading code must
call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
leading to 'kick_all_cpus_sync' as mentioned above.
A race can happen if we use the pointer and then one of these two scenarios
occurs at the same moment.
In order to be safe, the idle state pointer stored in the rq must be
used inside a rcu_read_lock section where we are protected with the
'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
idle_get_state() and idle_put_state() accessors should be used to that
effect.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linaro-kernel@lists.linaro.org
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Users can perform clustered scheduling using the cpuset facility.
After an exclusive cpuset is created, task migrations happen only
between CPUs belonging to the same cpuset. Inter- cpuset migrations
can only happen when the user requires so, moving a task between
different cpusets. This behaviour is broken in SCHED_DEADLINE, as
currently spurious inter- cpuset migration may happen without user
intervention.
This patch fix the problem (and shuffles the code a bit to improve
clarity).
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: raistlin@linux.it
Cc: michael@amarulasolutions.com
Cc: fchecconi@gmail.com
Cc: daniel.wagner@bmw-carit.de
Cc: vincent@legout.info
Cc: luca.abeni@unitn.it
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task is using SCHED_DEADLINE and the user setschedules it to a
different class its sched_dl_entity static parameters are not cleaned
up. This causes a bug if the user sets it back to SCHED_DEADLINE with
the same parameters again. The problem resides in the check we
perform at the very beginning of dl_overflow():
if (new_bw == p->dl.dl_bw)
return 0;
This condition is met in the case depicted above, so the function
returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not
added to it). After this, admission control is broken.
This patch fixes the thing, properly clearing static parameters for a
task that ceases to use SCHED_DEADLINE.
Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Tested-by: Luca Abeni <luca.abeni@unitn.it>
Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Tested-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
current->state == TASK_DEAD means that the task is doing its
last schedule(), page fault is obviously impossible at this
stage.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When task->comm is passed directly to audit_log_untrustedstring() without
getting a copy or using the task_lock, there is a race that could happen that
would output a NULL (\0) in the output string that would effectively truncate
the rest of the report text after the comm= field in the audit, losing fields.
Use get_task_comm() to get a copy while acquiring the task_lock to prevent
this and to prevent the result from being a mixture of old and new values of
comm.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
When an AUDIT_GET_FEATURE message is sent from userspace to the kernel, it
should reply with a message tagged as an AUDIT_GET_FEATURE type with a struct
audit_feature. The current reply is a message tagged as an AUDIT_GET
type with a struct audit_feature.
This appears to have been a cut-and-paste-eo in commit b0fed40.
Reported-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Report:
Looking at your example code in
http://people.redhat.com/rbriggs/audit-multicast-listen/audit-multicast-listen.c,
it seems that nlmsg_len field in the received messages is supposed to
contain the length of the header + payload, but it is always set to the
size of the header only, i.e. 16. The example program works, because
the printf format specifies the minimum width, not "precision", so it
simply prints out the payload until the first zero byte. This isn't too
much of a problem, but precludes the use of recvmmsg, iiuc?
(gdb) p *(struct nlmsghdr*)nlh
$14 = {nlmsg_len = 16, nlmsg_type = 1100, nlmsg_flags = 0, nlmsg_seq = 0, nlmsg_pid = 9910}
The only time nlmsg_len would have been updated was at audit_buffer_alloc()
inside audit_log_start() and never updated after. It should arguably be done
in audit_log_vformat(), but would be more efficient in audit_log_end().
Reported-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Since only one of val, uid, gid and lsm* are used at any given time, combine
them to reduce the size of the struct audit_field.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Various audit events dealing with adding, removing and updating rules result in
invalid values set for the op keys which result in embedded spaces in op=
values.
The invalid values are
op="add rule" set in kernel/auditfilter.c
op="remove rule" set in kernel/auditfilter.c
op="remove rule" set in kernel/audit_tree.c
op="updated rules" set in kernel/audit_watch.c
op="remove rule" set in kernel/audit_watch.c
Replace the space in the above values with an underscore character ('_').
Coded-by: Burn Alting <burn@swtf.dyndns.org>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Since there is already a primitive to do this operation in the atomic_t, use it
to simplify audit_serial().
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Use kernel.h definition.
Cc: Eric Paris <eparis@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Since the arch is found locally in __audit_syscall_entry(), there is no need to
pass it in as a parameter. Delete it from the parameter list.
x86* was the only arch to call __audit_syscall_entry() directly and did so from
assembly code.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-audit@redhat.com
Signed-off-by: Eric Paris <eparis@redhat.com>
---
As this patch relies on changes in the audit tree, I think it
appropriate to send it through my tree rather than the x86 tree.
The AUDIT_SECCOMP record looks something like this:
type=SECCOMP msg=audit(1373478171.953:32775): auid=4325 uid=4325 gid=4325 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0 pid=12381 comm="test" sig=31 syscall=231 compat=0 ip=0x39ea8bca89 code=0x0
In order to determine what syscall 231 maps to, we need to have the arch= field right before it.
To see the event, compile this test.c program:
=====
int main(void)
{
return seccomp_load(seccomp_init(SCMP_ACT_KILL));
}
=====
gcc -g test.c -o test -lseccomp
After running the program, find the record by: ausearch --start recent -m SECCOMP -i
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
signed-off-by: Eric Paris <eparis@redhat.com>
Since every arch should have syscall_get_arch() defined, stop using the
function argument and just collect this ourselves. We do not drop the
argument as fixing some code paths (in assembly) to not pass this first
argument is non-trivial. The argument will be dropped when that is
fixed.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c
Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup fix from Tejun Heo:
"One late fix for cgroup.
I was waiting for another set of fixes for a long-standing obscure
cpuset bug but am not sure whether they'll be ready before v3.17
release. This one is a simple fix for a mutex unlock balance bug in
an allocation failure path in pidlist_array_load().
The bug was introduced in v3.14 and the fix is tagged for -stable"
* 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix unbalanced locking
This patch moves Exynos PM domain code to use the new generic PM domain
look-up framework introduced in previous patches, thus also allowing
the new code to be compiled with CONFIG_ARCH_EXYNOS.
This patch was originally submitted by Tomasz Figa when he was employed
by Samsung.
Link: http://marc.info/?l=linux-pm&m=139955336002083&w=2
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Kevin Hilman <khilman@linaro.org>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch introduces generic code to perform PM domain look-up using
device tree and automatically bind devices to their PM domains.
Generic device tree bindings are introduced to specify PM domains of
devices in their device tree nodes.
Backwards compatibility with legacy Samsung-specific PM domain bindings
is provided, but for now the new code is not compiled when
CONFIG_ARCH_EXYNOS is selected to avoid collision with legacy code.
This will change as soon as the Exynos PM domain code gets converted to
use the generic framework in further patch.
This patch was originally submitted by Tomasz Figa when he was employed
by Samsung.
Link: http://marc.info/?l=linux-pm&m=139955349702152&w=2
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Tested-by: Philipp Zabel <p.zabel@pengutronix.de>
Reviewed-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds another suspend_resume trace event for analyze_suspend
to capture. The resume_console call can take several hundred milliseconds
if the printk buffer is full of debug info. The tool will now inform
testers of the wasted time and encourage them to disable it in
production builds.
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Both pinned_sb and new_sb indicate if a new superblock is needed,
so we can just remove new_sb.
Note now we must check if kernfs_tryget_sb() returns NULL, because
when it returns NULL, kernfs_mount() may still re-use an existing
superblock, which is just allocated by another concurent mount.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The patch 971ff49355: "cgroup: use a per-cgroup work for release
agent" from Sep 18, 2014, leads to the following static checker
warning:
kernel/cgroup.c:5310 cgroup_release_agent()
warn: 'mutex:&cgroup_mutex' is sometimes locked here and sometimes unlocked.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull perf fixes from Ingo Molnar:
"Two kernel side fixes: a kprobes fix and a perf_remove_from_context()
fix (which does not yet fix the migration bug which is WIP)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix a race condition in perf_remove_from_context()
kprobes/x86: Free 'optinsn' cache when range check fails
We call put_css_set() after setting CGRP_RELEASABLE flag in
cgroup_task_migrate(), but in other places we call it without setting
the flag. I don't see the necessity of this flag.
Moreover once the flag is set, it will never be cleared, unless writing
to the notify_on_release control file, so it can be quite confusing
if we look at the output of debug.releasable.
# mount -t cgroup -o debug xxx /cgroup
# mkdir /cgroup/child
# cat /cgroup/child/debug.releasable
0 <-- shows 0 though the cgroup is empty
# echo $$ > /cgroup/child/tasks
# cat /cgroup/child/debug.releasable
0
# echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks
# cat /proc/child/debug.releasable
1 <-- shows 1 though the cgroup is not empty
This patch removes the flag, and now debug.releasable shows if the
cgroup is empty or not.
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>